[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3745/KTSDE.2013.2.2.123

Genome Analysis Pipeline I/O Workload Analysis

Lim, Kyeongyeol (한양대학교 전자컴퓨터통신과)
Kim, Dongoh (한국전자통신연구원)
Kim, Hongyeon (한국전자통신연구원)
Park, Geehan (한양대학교 전자컴퓨터통신과)
Choi, Minseok (한양대학교 전자컴퓨터통신과)
Won, Youjip (한양대학교 전자컴퓨터통신공학과)

Publication Information

KIPS Transactions on Software and Data Engineering / v.2, no.2, 2013 , pp. 123-130 More about this Journal

Abstract

As size of genomic data is increasing rapidly, the needs for high-performance computing system to process and store genomic data is also increasing. In this paper, we captured I/O trace of a system which analyzed 500 million sequence reads data in Genome analysis pipeline for 86 hours. The workload created 630 file with size of 1031.7 Gbyte and deleted 535 file with size of 91.4 GByte. What is interesting in this workload is that 80% of all accesses are from only two files among 654 files in the system. Size of read and write request in the workload was larger than 512 KByte and 1 Mbyte, respectively. Majority of read write operations show random and sequential patterns, respectively. Throughput and bandwidth observed in each processing phase was different from each other.

Keywords

Bioinformatics; Workload Analysis; SSD;

Citations & Related Records

Reference

1	C. Bell, R. Dixon, A. Farmer, R. Flo-res, J. Inman, R. Gonzales, M. Harri-son, N. Paiva, A. Scott, J. Weller, et al., "The medicago genome initiative: a model legume database," Nucleic Acids Research, Vol.29, No.1, pp.114-117, 2001. DOI ScienceOn
2	L. Matukumalli, J. Grefenstette, D. Hyten, I. Choi, P. Cregan, and C. Van Tassell, "Snp-phage-high throughput snp discov-ery pipeline," BMC bioinformatics, Vol.7, No.1, pp.468, 2006. DOI
3	Seon-Hee Park, "IT based Bioinformatics," kiise, Vol.21, No.6, pp.20-26, 2003.
4	Ik-Young Choi, "A review of the technology of genome & expression analysis," TiBMB, Vol.30, No.2, pp.25-35, 2010.
5	E. Lander, L. Linton, B. Birren, C. Nus-baum, M. Zody, J. Baldwin, K. Devon, K. Dewar, M. Doyle, W. FitzHugh, et al., "Initial sequencing and analysis of the hu-man genome," Nature, Vol.409, No.6822, pp.860-921, 2001. DOI ScienceOn
6	A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, K. Garimella, D. Altshuler, S. Gabriel, M. Daly, et al., "The genome analysis toolkit: a mapreduce framework for an-alyzing next-generation dna sequencing data," Genome research, Vol.20, No.9, pp.1297-1303, 2010. DOI ScienceOn
7	H. Li and R. Durbin, "Fast and accu-rate short read alignment with burrows-wheeler transform," Bioinformatics, Vol.25, No.14, pp.1754-1760, 2009. DOI ScienceOn
8	H. Li, B. Handsaker, A. Wysoker, T. Fen-nell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, et al., "The se-quence alignment/map format and sam-tools," Bioinformatics, Vol.25, No.16, pp.2078-2079, 2009. DOI ScienceOn
9	FUSE, "Filesystem in userspace." http://fuse.sourceforge.net/.
10	J. Kang, H. Jo, J. Kim, and J. Lee, "A superblock-based flash translation layer for nand flash memory," pp.161-170, 2006.

KSCI

Genome Analysis Pipeline I/O Workload Analysis 유전체 분석 파이프라인의 I/O 워크로드 분석

Genome Analysis Pipeline I/O Workload Analysis