[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.12673/jant.2014.18.4.401

Processing Method of Mass Small File Using Hadoop Platform

Kim, Chang-Bok (Department of Energy IT, Gachon University)
Chung, Jae-Pil (Department of Electronic Engineering, Gachon University)

Publication Information

Journal of Advanced Navigation Technology / v.18, no.4, 2014 , pp. 401-408 More about this Journal

Abstract

Hadoop is composed with MapReduce programming model for distributed processing and HDFS distributed file system. Hadoop is suitable framework for big data processing, but processing of mass small files have many problems. The processing of mass small file in hadoop have problems to created one mapper per one file, and it have problems to needed many memory for store of meta information of file. This paper have comparison evaluation processing method of mass small file with various method in hadoop platform. The processing of general compression format is inadequate because of processing by one mapper regardless of data size. The processing of sequence and hadoop archive file is removed memory problem of namenode by compress and combine of small file. Hadoop archive file is faster then sequence file about combine time of small file. The processing using CombineFileInputFormat class is needed not combine of small file, and it have similar speed big data processing method.

Keywords

Big data; CombineFileInputFormat; Hadoop distributed file system; MapReduce; Small file;

Citations & Related Records

Times Cited By KSCI : 2 (Citation Analysis)

Reference
Cited By KSCI

1	C. W. An, and S. K. Hwang, "Big data technologies and main issues," Journal of Korean Institute of Information Scientists and Engineers, Vol. 30, No. 6, pp.10-17, Jun. 2012. 과학기술학회마을
2	Apache Hadoop, http://hadoop.apache.org/
3	K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The hadoop distributed file system," in Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on IEEE, Las Vegas: NV, pp. 1-10, 2010.
4	J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, Vol. 51, Issue 1, pp. 107-113, Jan. 2008.
5	B. G. Gu, "FiVE: File Virtual Expanding technique to efficiently process small data on Hadoop," Journal of Korean Institute of Information Technology, Vol 10, No.10, pp.69-78, Oct. 2012.
6	G. Mackey, S. Sehrish, and J. Wang. "Improving metadata management for small files in HDFS," in Cluster Computing and Workshops, 2009. CLUSTER'09. IEEE International Conference on IEEE, New Orleans: LA, pp. 1-4, Aug. 2009.
7	J. H. Jung, "Beginning, Hadoop Programming," Wikibooks, Oct. 2012.
8	C. He, Y. Lu, and D. Swanson, "Matchmaking: a new MapReduce scheduling technique," Proceedings of Cloud Computer'11, pp. 40-47, 2011.
9	http://code.google.com/p/snappy
10	http://blog.cloudera.com/blog/2011/01/hadoop-io-sequence-map-set-array-bloommap-files/
11	http://blog.cloudera.com/blog/2009/02/the-small-files-problem

9	Chul Woong Choi. (2015) Journal of Korea Multimedia Society A Study on the Improving Performance of Massively Small File Using the Reuse JVM in MapReduce / 18 (9) , 1098
10	(2014) 한국전자통신학회 논문지 하둡 프레임워크 기반 분산시스템 내의 작은 파일들을 효율적으로 처리하기 위한 방법의 설계 / 10 (10) , 1115

KSCI

Processing Method of Mass Small File Using Hadoop Platform 하둡 플랫폼을 이용한 대량의 스몰파일 처리방법

Processing Method of Mass Small File Using Hadoop Platform