Browse > Article
http://dx.doi.org/10.3745/KTSDE.2013.2.2.123

Genome Analysis Pipeline I/O Workload Analysis  

Lim, Kyeongyeol (한양대학교 전자컴퓨터통신과)
Kim, Dongoh (한국전자통신연구원)
Kim, Hongyeon (한국전자통신연구원)
Park, Geehan (한양대학교 전자컴퓨터통신과)
Choi, Minseok (한양대학교 전자컴퓨터통신과)
Won, Youjip (한양대학교 전자컴퓨터통신공학과)
Publication Information
KIPS Transactions on Software and Data Engineering / v.2, no.2, 2013 , pp. 123-130 More about this Journal
Abstract
As size of genomic data is increasing rapidly, the needs for high-performance computing system to process and store genomic data is also increasing. In this paper, we captured I/O trace of a system which analyzed 500 million sequence reads data in Genome analysis pipeline for 86 hours. The workload created 630 file with size of 1031.7 Gbyte and deleted 535 file with size of 91.4 GByte. What is interesting in this workload is that 80% of all accesses are from only two files among 654 files in the system. Size of read and write request in the workload was larger than 512 KByte and 1 Mbyte, respectively. Majority of read write operations show random and sequential patterns, respectively. Throughput and bandwidth observed in each processing phase was different from each other.
Keywords
Bioinformatics; Workload Analysis; SSD;
Citations & Related Records
연도 인용수 순위
  • Reference
1 C. Bell, R. Dixon, A. Farmer, R. Flo-res, J. Inman, R. Gonzales, M. Harri-son, N. Paiva, A. Scott, J. Weller, et al., "The medicago genome initiative: a model legume database," Nucleic Acids Research, Vol.29, No.1, pp.114-117, 2001.   DOI   ScienceOn
2 L. Matukumalli, J. Grefenstette, D. Hyten, I. Choi, P. Cregan, and C. Van Tassell, "Snp-phage-high throughput snp discov-ery pipeline," BMC bioinformatics, Vol.7, No.1, pp.468, 2006.   DOI
3 Seon-Hee Park, "IT based Bioinformatics," kiise, Vol.21, No.6, pp.20-26, 2003.
4 Ik-Young Choi, "A review of the technology of genome & expression analysis," TiBMB, Vol.30, No.2, pp.25-35, 2010.
5 E. Lander, L. Linton, B. Birren, C. Nus-baum, M. Zody, J. Baldwin, K. Devon, K. Dewar, M. Doyle, W. FitzHugh, et al., "Initial sequencing and analysis of the hu-man genome," Nature, Vol.409, No.6822, pp.860-921, 2001.   DOI   ScienceOn
6 A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, K. Garimella, D. Altshuler, S. Gabriel, M. Daly, et al., "The genome analysis toolkit: a mapreduce framework for an-alyzing next-generation dna sequencing data," Genome research, Vol.20, No.9, pp.1297-1303, 2010.   DOI   ScienceOn
7 H. Li and R. Durbin, "Fast and accu-rate short read alignment with burrows-wheeler transform," Bioinformatics, Vol.25, No.14, pp.1754-1760, 2009.   DOI   ScienceOn
8 H. Li, B. Handsaker, A. Wysoker, T. Fen-nell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, et al., "The se-quence alignment/map format and sam-tools," Bioinformatics, Vol.25, No.16, pp.2078-2079, 2009.   DOI   ScienceOn
9 FUSE, "Filesystem in userspace." http://fuse.sourceforge.net/.
10 J. Kang, H. Jo, J. Kim, and J. Lee, "A superblock-based flash translation layer for nand flash memory," pp.161-170, 2006.