FASIM: Fragments Assembly Simulation using Biased-Sampling Model and Assembly Simulation for Microbial Genome Shotgun Sequencing

  • Hur Cheol-Goo (Division of Genomics and Proteomics, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Bioinformatics Cooperative Course, Pusan National University) ;
  • Kim Sunny (Division of Genomics and Proteomics, Korea Research Institute of Bioscience and Biotechnology (KRIBB)) ;
  • Kim Chang-Hoon (Division of Genomics and Proteomics, Korea Research Institute of Bioscience and Biotechnology (KRIBB)) ;
  • Yoon Sung-Ho (Division of Genomics and Proteomics, Korea Research Institute of Bioscience and Biotechnology (KRIBB)) ;
  • In Yong-Ho (Bioinfomatix, Inc.) ;
  • Kim Cheol-Min (Bioinformatics Cooperative Course, Pusan National University) ;
  • Cho Hwan-Gue (Bioinformatics Cooperative Course, Pusan National University)
  • 발행 : 2006.05.01

초록

We have developed a program for generating shotgun data sets from known genome sequences. Generation of synthetic data sets by computer program is a useful alternative to real data to which students and researchers have limited access. Uniformly-distributed-sampling clones that were adopted by previous programs cannot account for the real situation where sampled reads tend to come from particular regions of the target genome. To reflect such situation, a probabilistic model for biased sampling distribution was developed by using an experimental data set derived from a microbial genome project. Among the experimental parameters tested (varied fragment or read lengths, chimerism, and sequencing error), the extent of sequencing error was the most critical factor that hampered sequence assembly. We propose that an optimum sequencing strategy employing different insert lengths and redundancy can be established by performing a variety of simulations.

키워드

참고문헌

  1. Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402 https://doi.org/10.1093/nar/25.17.3389
  2. Engle, M. L. and C. Burks. 1994. Artificially generated data sets for testing DNA sequence assembly algorithms. Genomics 16: 286-288 https://doi.org/10.1006/geno.1993.1180
  3. Ewing, B., L. Hillier, M. Wendl, and P. Green. 1998. Basecalling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res. 8: 175-185 https://doi.org/10.1101/gr.8.3.175
  4. Huang, X. and A. Madan. 1999. CAP3: A DNA sequence assembly program. Genome Res. 9: 868-877 https://doi.org/10.1101/gr.9.9.868
  5. Lander, E. S. and M. S. Waterman. 1988. Genomic mapping by fingerprinting random clones: A mathematical analysis. Genomics 2: 231-239 https://doi.org/10.1016/0888-7543(88)90007-9
  6. Lee, D.-H., W. J. Jun, J. W. Yoon, H. Y. Cho, and B. S. Hong. 2004. Process strategies to enhance the production of 5-aminolevulinic acid with recombinant E. coli. J. Microbiol. Biotechnol. 16: 1310-1317
  7. Lee, P. C., S. Y. Lee, S. H. Hong, and H. N. Chang. 2002. Isolation and characterization of a new succinic acid-producing bacterium, Mannheimia succiniciproducens MBEL55E, from bovine rumen. Appl. Microbiol. Biotechnol. 58: 663-668 https://doi.org/10.1007/s00253-002-0935-6
  8. Lim, S. Y., K. H. Yong, and S. Y. Ry. 2005. Analysis of Salmonella pathogenicity island 1 expression in response to the changes of osmolarity. J. Microbiol. Biotechnol. 15: 175-182 https://doi.org/10.1159/000083650
  9. Kim, H. W., K. M. Kim, E. J. Ko, S. K. Lee, S. D. Ha, K. B. Song, S. K. Park, K. S. Kwon, and D. H. Bae. 2004. Development of antimicrobial edible film from defatted soybean meal fermented by Bacillus subtilis. J. Microbiol. Biotechnol. 14: 1303-1309
  10. Kang, S. A., J. C. Lee, Y. M. Park, C. Lee, S. H. Kim, B. I. Chang, C. H. Kim, J. W. Seo, S. K. Rhee, S. J. Jung, S. M. Kim, S. K. Park, and K. I. Jang. 2004. Secretory production of Rahnella aquatilis ATCC 33071 levansucrase expressed in Escherichia coli. J. Microbiol. Biotechnol. 14: 1232-1238
  11. May, B. J., Q. Zhang, L. L. Li, M. L. Paustian, T. S. Whittam, and V. Kapur. 2001. Complete genomic sequence of Pasteurella multocida, Pm70. Proc. Natl. Acad. Sci. USA 98: 3460-3465
  12. Myers, G. 1999. A dataset generator for whole genome shotgun sequencing. Proc. Int. Conf. Intell. Syst. Mol. Biol. pp. 202-210
  13. Myers, G. 1999. Whole-genome DNA sequencing. Comput. Sci. Eng. 1: 33-43 https://doi.org/10.1109/5992.764214
  14. Pop, M., S. Salzberg, and M. Shumway. 2002. Genome sequence assembly: Algorithms and issues. IEEE Computer 35: 47-54
  15. Roach, J. C., C. Boysen, K. Wang, and L. Hood. 1995. Pairwise end sequencing: A unified approach to genomic mapping and sequencing. Genomics 26: 345-353 https://doi.org/10.1016/0888-7543(95)80219-C