A Heuristic Algorithm to Find All Normalized Local Alignments Above Threshold

  • Kim, Sangtae (Department of Computer Science, Korea Military Academy) ;
  • Sim, Jeong Seop (Electronics and Telecommunications Research Institute) ;
  • Park, Heejin (School of Computer Science and Engineering, Seoul National University) ;
  • Park, Kunsoo (Electronics and Telecommunications Research Institute, School of Computer Science and Engineering, Seoul National University) ;
  • Park, Hyunseok (Institute of Bioinformatics, Macrogen, Inc., Department of Computer Science, Ewha Womans University) ;
  • Seo, Jeong-Sun (Institute of Bioinformatics, Macrogen, Inc., Ilcheon Molecular Medicine Institute, Seoul National University)
  • Published : 2003.09.01

Abstract

Local alignment is an important task in molecular biology to see if two sequences contain regions that are similar. The most popular approach to local alignment is the use of dynamic programming due to Smith and Waterman, but the alignment reported by the Smith-Waterman algorithm has some undesirable properties. The recent approach to fix these problems is to use the notion of normalized scores for local alignments by Arslan, Egecioglu and Pevzner. In this paper we consider the problem of finding all local alignments whose normalized scores are above a given threshold, and present a fast heuristic algorithm. Our algorithm is 180-330 times faster than Arslan et al.'s for sequences of length about 120 kbp and about 40-50 times faster for sequences of length about 30 kbp.

Keywords

References

  1. Alexandrov, N.N., and Solovyev, V.V., (1998). Statistical significance of ungapped alignments, Pacific Symposium on Biocomputing' 98, 463-472
  2. Arslan, A.N., and E{\breve{g}}ecio{\breve{g}}lu,\;{\ddot{O}}.(1999), An efficient uniform-cost normalized edit distance algorithm, Symposium on String Processing and Information Retrieval' 99, IEEE Computer Society, 8-15
  3. Arslan, A.N., and E{\breve{g}}ecio{\breve{g}}lu,\;{\ddot{O}}.(2003), Efficient algorithms for normalized edit distances, Journal of Discrete Algorithms, Hermes Science Publications, in press
  4. Arslan, A.N., E{\breve{g}}ecio{\breve{g}}lu,\;{\ddot{O}}., and Pevzner, P. (2001). A new approach to sequence comparison: normalized sequence alignment, Bioinformatics 17, 327-337 https://doi.org/10.1093/bioinformatics/17.4.327
  5. Chen, T., and Skiena, S.S., (1997). Trie-based data structures for sequence assembly, Combinatorial Pattem Matching' 97, 206-223
  6. Dinkelbach, W., (1967). On nonlinear fractional programming, Management Science 13, 492-498 https://doi.org/10.1287/mnsc.13.7.492
  7. E{\breve{g}}ecio{\breve{g}}lu,\;{\ddot{O}}., and Ibel, M. (1996). Parallel algorithms for fast computation of normalized edit distances, IEEE Symposium on Parallel and Distributed Processing' 96, 496-503
  8. Gotoh, O., (1982). improved algorithm for matching biological sequences, Joumal of Molecular Biology 162, 705-708 https://doi.org/10.1016/0022-2836(82)90398-9
  9. Goad, W.B., and Kanehisa, M.I. (1982). Pattern recognition in nucleic acid sequences. i. a general method for finding local homologies and symmetries, Nucleic Acids Research 10, 247-263 https://doi.org/10.1093/nar/10.1.247
  10. Green, P., Documentation for phrap, Genome Center, University of Washington, http://www.phrap.org/phrap.docs/phrap.html
  11. Gusfield, D. (1997). Algorithms on Strings, Trees, and Sequences, Cambridge University Press
  12. Lipman, D., and Pearson, W. (1988) Improved tools for biological sequence comparison, Proceedings of National Academy of Science 85, 2444-2448 https://doi.org/10.1073/pnas.85.8.2444
  13. Marzal, A., and Vidal, E. (1993) Computation of normalized edit distances and applications, IEEE Transactions on Pattern Analysis and Machine Intelligence 15, 926-932 https://doi.org/10.1109/34.232078
  14. Sellers, P.H. (1984). Pattern recognition in genetic sequences by mismatch density, Bulletin of Mathematical Biology 46, 501-504 https://doi.org/10.1007/BF02459499
  15. Setubal, J., and Meidanis, J., (1997). Introduction to computational molecular biology, PWS Publishing Company
  16. Smith, T.F., and Waterman, M.S. (1981). Identification of common molecular subsequences, Journal of Molecular Biology 147, 195-197 https://doi.org/10.1016/0022-2836(81)90087-5
  17. Waterman, M.S., (1995). Introduction to Computational Biology, Chapman & Hall, London
  18. Zhang, Z., Berman, P., and Miller, W. (1998). Alignments without low scoring regions, Journal of Computational Biology 5, 197-200 https://doi.org/10.1089/cmb.1998.5.197
  19. Zhang, Z., Berman, P., Wiehe, T., and Miller, W. (1999). Post-processing long pairwise alignments, Bioinformatics 15, 1012-1019 https://doi.org/10.1093/bioinformatics/15.12.1012