Browse > Article
http://dx.doi.org/10.3745/KIPSTD.2009.16D.5.661

A CNV detection algorithm based on statistical analysis of the aligned reads  

Hong, Sang-Kyoon (한림대학교 컴퓨터공학과)
Hong, Dong-Wan (한림대학교 바이오메디컬학과)
Yoon, Jee-Hee (한림대학교 컴퓨터공학과)
Kim, Baek-Sop (한림대학교 컴퓨터공학과)
Park, Sang-Hyun (연세대학교 컴퓨터과학과)
Abstract
Recently it was found that various genetic structural variations such as CNV(copy number variation) exist in the human genome, and these variations are closely related with disease susceptibility, reaction to treatment, and genetic characteristics. In this paper we propose a new CNV detection algorithm using millions of short DNA sequences generated by giga-sequencing technology. Our method maps the DNA sequences onto the reference sequence, and obtains the occurrence frequency of each read in the reference sequence. And then it detects the statistically significant regions which are longer than 1Kbp as the candidate CNV regions by analyzing the distribution of the occurrence frequency. To select a proper read alignment method, several methods are employed in our algorithm, and the performances are compared. To verify the superiority of our approach, we performed extensive experiments. The result of simulation experiments (using a reference sequence, build 35 of NCBI) revealed that our approach successfully finds all the CNV regions that have various shapes and arbitrary length (small, intermediate, or large size).
Keywords
Copy Number Variation(CNV); Giga-Sequencing; Sequence Alignment; Statistical Significancy;
Citations & Related Records
연도 인용수 순위
  • Reference
1 F. S. Robert, "The Race for the $1000 Genome," SCIENCE, Vol.311, pp.1544-1546, 2006.   DOI   ScienceOn
2 R. Redon, et al, "Global variation in copy number in the human genome," Nature, Vol.444, pp.444-454, 2006.   DOI   ScienceOn
3 A. J. Iafrate, L. Feuk, M. N. Rivera, M. L. Listewnik, P. K. Donahoe, Y. Qi, S. W. Scherer, and C. Lee, "Detection of large-scale variation in the human genome," Nat. Genet., Vol.36, pp.949-951, 2004.   DOI   ScienceOn
4 http://maq.sourceforge.net
5 http://rulai.cshl.edu/rmap
6 http://brainarray.mbni.med.umich. edu/Brainarray/SequenceAlignment/AQUESA
7 홍상균, 홍동완, 윤지희, 김종일. “Short read 서열정렬에 의한 CNV 영역 추출,” In proceedings of KDBC 2008, pp.297-305, 2008.
8 E. Tuzun, A. J. Sharp, J. A. Bailey, R. Kaul, V. A. Morrison, L. M. Pertz, E. Haugen, H. Hayden, D. Albertson, D. Pinkel, M. V. Olson, and E. E. Eichler, "Fine-scale structural variation of the human genome," Nat. Genet., Vol.37, No.7, pp.727-732, 2005.   DOI   ScienceOn
9 R. E. Mills, C. T. Luttig, C. E. Larkins, A. Beauchamp, C. Tsui, W. S. Pittard, and S. E. Devine, "An initial map of insertion and deletion (INDEL) variation in the human genome," Genome Res., Vol.16, pp.1182–1190, 2006.   DOI   ScienceOn
10 S. W. Schrer, C. Lee, E. Birney, D. M. Altshuler, E. E. Eichler, N. P. Carter, M. E. Hurles, and L. Feuk, "Challenges and standards in integrating surveys of structural variation," Nat. Genet., Vol.39, No.7, S7-S15, 2007.   DOI   ScienceOn
11 http://www.cbcb.umd.edu/software/RepeatFinder
12 R. L. Warren, G. G. Sutton, S. J. Jones, and R. A. Holt, "Assembling millions of short DNA sequences using SSAKE," Bioinformatics Vol.23, No.4, pp.500-501, 2007.   DOI   ScienceOn
13 J. C. Dohm, C. Lottaz, T. Borodina, and H. Himmelbauer, "SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing," Genome Res. Vol.17, No.11, pp.1697-1706, 2007   DOI   ScienceOn
14 http://www.illumina.com
15 R. Li, et al, "SOAP: short oligonucleotide alignment program.," Bioinformatics Vol.24, No.5, pp.713-714, 2008.   DOI   ScienceOn
16 P. Weiner, "Linear Pattern Matching Algorithms," Proc. 14th IEEE Annual Symp. on Switching and Automata Theory, pp.1-11, 1973.   DOI
17 W. W. Daniel, "Biostatistics (8th ed.)," Wiley, 2005.
18 U. Manber and G.e Myers, "Suffix arrays: a new method for on-line string searches," SIAM Journal on Computing, Vol.22, Issue 5, pp.935-948, 1993.   DOI   ScienceOn
19 S. W. Schrer, C. Lee, E. Birney, D. M. Altshuler, E. E. Eichler, N. P. Carter, M. E. Hurles, and L. Feuk, "Challenges and standards in integrating surveys of structural variation," Nat. Genet., Vol.39, No.7, S7-S15, 2007.   DOI   ScienceOn
20 S. Altschul, T. Madden, A. Schaffer, J. Zhang, W. Miller, and D. Lipman, "Gapped BLAST and PSI-BLAST: A New Generation of Protein Data-base Search Programs," Nucleic Acids Research, Vol.25 No.17 pp.3389-3402, 1997.   DOI
21 http://projects.tcag.ca/variation
22 S. Tada, R. Hankins, and J. Patel, "Practical Suffix Tree Construction," In Proceedings of the 30th VLDB Conference, pp.36-47, 2004.
23 W. J. Kent, "BLAT - The Blast - Like Alignment Tool," Genome Research, Vol.12, No.4, pp.656-664, 2002.   DOI
24 J. Sebat, B. Lakshmi, J. Troge, J. Alexander, J. Young, P. Lundin, S. Månér, H. Massa, M. Walker, M. Chi, N. Navin, R. Lucito, J. Healy, J. Hicks, K. Ye, A. Reiner, T. C. Gillian, B. Trask, N. Patterson, A. Zetterberg, and M. Wigler, "Large-Scale Copy Number Polymorphism in the Human Genome," Science, Vol.305, pp.525-528, 2004.   DOI   ScienceOn
25 R. Khaja, J. Zhang, J. R. MacDonal, H. Yongshu, M. J. Joseph-George, J. Wei, M. A. Rafiq, C. Qian, Shago M., L. Pantano, H. Aburatani, K. Jones, R. Redon, M. Hurles, L. Armengol, X. Estivill, R. J. Mural, C. Lee, S. W. Scherer, and L. Feuk, "Genome assembly comparison identifies structural variants in the human genome," Nat. Genet., Vol.38, No.12, pp.1413-1418, 2006.   DOI   ScienceOn
26 W. R. Jeck, J. A. Reinhardt, D. A. Baltrus, M. T. Hickenbotham, V. Magrini, E. R. Mardis, J. L. Dangl, and C. D. Jones, "Extending assembly of short DNA sequences to handle error," Bioinformatics Vol.23, No.21, pp.2942-2944, 2007.   DOI   ScienceOn
27 D. L. Wheeler, C. Chappey, A. E. Lash, D. D. Leipe, T. L. Madden, G. D. Schuler, T. A. Tatusova and B. A. Rapp, "Database resources of the National Center for Biotechnology Information," Nucleic Acids Research, Vol.28 No.1 pp.10-14, 2000.   DOI