Browse > Article
http://dx.doi.org/10.12673/jant.2014.18.4.401

Processing Method of Mass Small File Using Hadoop Platform  

Kim, Chang-Bok (Department of Energy IT, Gachon University)
Chung, Jae-Pil (Department of Electronic Engineering, Gachon University)
Abstract
Hadoop is composed with MapReduce programming model for distributed processing and HDFS distributed file system. Hadoop is suitable framework for big data processing, but processing of mass small files have many problems. The processing of mass small file in hadoop have problems to created one mapper per one file, and it have problems to needed many memory for store of meta information of file. This paper have comparison evaluation processing method of mass small file with various method in hadoop platform. The processing of general compression format is inadequate because of processing by one mapper regardless of data size. The processing of sequence and hadoop archive file is removed memory problem of namenode by compress and combine of small file. Hadoop archive file is faster then sequence file about combine time of small file. The processing using CombineFileInputFormat class is needed not combine of small file, and it have similar speed big data processing method.
Keywords
Big data; CombineFileInputFormat; Hadoop distributed file system; MapReduce; Small file;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 C. W. An, and S. K. Hwang, "Big data technologies and main issues," Journal of Korean Institute of Information Scientists and Engineers, Vol. 30, No. 6, pp.10-17, Jun. 2012.   과학기술학회마을
2 Apache Hadoop, http://hadoop.apache.org/
3 K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The hadoop distributed file system," in Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on IEEE, Las Vegas: NV, pp. 1-10, 2010.
4 J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, Vol. 51, Issue 1, pp. 107-113, Jan. 2008.
5 B. G. Gu, "FiVE: File Virtual Expanding technique to efficiently process small data on Hadoop," Journal of Korean Institute of Information Technology, Vol 10, No.10, pp.69-78, Oct. 2012.
6 G. Mackey, S. Sehrish, and J. Wang. "Improving metadata management for small files in HDFS," in Cluster Computing and Workshops, 2009. CLUSTER'09. IEEE International Conference on IEEE, New Orleans: LA, pp. 1-4, Aug. 2009.
7 J. H. Jung, "Beginning, Hadoop Programming," Wikibooks, Oct. 2012.
8 C. He, Y. Lu, and D. Swanson, "Matchmaking: a new MapReduce scheduling technique," Proceedings of Cloud Computer'11, pp. 40-47, 2011.
9 http://code.google.com/p/snappy
10 http://blog.cloudera.com/blog/2011/01/hadoop-io-sequence-map-set-array-bloommap-files/
11 http://blog.cloudera.com/blog/2009/02/the-small-files-problem