Computational Challenges for Integrative Genomics

  • Kim, Junhyong (Department of Biology, Penn Center for Bioinformatics, University of Pennsylvania) ;
  • Magwene, Paul (Department of Computer and Information Science, Penn Center for Bioinformatics, University of Pennsylvania)
  • Published : 2004.03.01

Abstract

Integrated genomics refers to the use of large-scale, systematically collected data from various sources to address biological and biomedical problems. A critical ingredient to a successful research program in integrated genomics is the establishment of an effective computational infrastructure. In this review, we suggest that the computational infrastructure challenges include developing tools for heterogeneous data organization and access, innovating techniques for combining the results of different analyses, and establishing a theoretical framework for integrating biological and quantitative models. For each of the three areas - data integration, analyses integration, and model integration - we review some of the current progress and suggest new topics of research. We argue that the primary computational challenges lie in developing sound theoretical foundations for understanding the genome rather than simply the development of algorithms and programs.

Keywords

References

  1. Alizadeh, A A, Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A, Boldrick, J. C., Sabet, H., Tran, T., Yu, X., et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503-511 https://doi.org/10.1038/35000501
  2. Allman, E.S. and Rhodes, J.A .(2003). Phylogenetic Invariants for the General Markov Model of Sequence Mutation. Math, Biasci. 186, 113-144 https://doi.org/10.1016/j.mbs.2003.08.004
  3. Ashburner, M., et al. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25-29 https://doi.org/10.1038/75556
  4. Ben-Dor, A, Pe'er, I., Shamir, R, and Sharan, R. (2001). On the complexity of positional sequencing by hybridization. J. Comput. Bioi. 8, 361-71 https://doi.org/10.1089/106652701752236188
  5. Bookstein, F. L. (1991). Morphometric tools for landmark data. Cambridge University Press, New York
  6. Carter, A.J.R and Wagner, G.P. (2002). EVOlution of functionally conserved enhancers can be accelerated in large populations: a population genetic model. Prcx:. Roy Soc., BioI. 169, 953-960
  7. Chi, J.T., et al. (2003). Genomewide view of gene silencing by small interfering RNAs. Proc. Nat!. Acad. Sci. USA 100, 6343-6346 https://doi.org/10.1073/pnas.1037853100
  8. Collins, F.S., Morgan, M., and Patrinos, A (2003). The human genome project: lesions from large-scale biology. Science 300, 286-290 https://doi.org/10.1126/science.1084564
  9. Davidson, S.B., Crabtree, J., Brunk, B.P., Schug, J., Tannen, V., Overton, G.C., and Stoeckert, C.J.Jr. (2001). K2/Kleisli and GUS: Experiments in Integrated Access to Genomic Data Sources. IBM Systems Journal 40, 512-531 https://doi.org/10.1147/sj.402.0512
  10. Cooper, G. M., Brudno, M., NISC Comparative Sequencing Program, Green, E.D., Batzoglou, S., and Sidow, A (2003). Quantitative estimates of sequence divergence for comparative analyses of mammalian genomes. Genome Res. 13, 813-820 https://doi.org/10.1101/gr.1064503
  11. Cryan, M., Goldberg, L., et al. (1998). Evolutionary Trees can be Learned in Polynomial Time in the Two-State General Markov Model. IEEE Symposium on Foundations of Computer Science, 436-445
  12. Deutsch, A, et al. (1999). Physical data independence, constraints, and optimization with universal plans. Proc. VLDB
  13. Drawid, A. and Gerstein, M. (2003). A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome. J. Mol. Bioi. 301, 1059-1075
  14. Durbin, R, Eddy, S., Krogh, A, and Mitchison, G. (1998). Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press
  15. Brat, A, Hoffmann, F., Kriegel, K., Schultz, C., and Wenk, C. (2002). Geometric algorithms for the analysis of 2D-electrophoresis gels. J. Comput. Bioi. 9, 299-315 https://doi.org/10.1089/10665270252935476
  16. Farach, M. and Kannan, S. (1999). Efficient algorithms for inverting evolution. JACM 46, 437-450 https://doi.org/10.1145/320211.320212
  17. Felsenstein, J. (2003). Inferring phylogenies. Sinaur, Sunderland, MA
  18. Foth, B.J., Ralph, SA, Tonkin, C.J., Struck, N.S., Fraunholz, M., Roos, D.S., Cowman, A.F., and McFadden, G.I. (2003). Dissecting apicoplast targeting in the malaria parasite Plasmodium falciparum. Science 299, 705-8 https://doi.org/10.1126/science.1078599
  19. Friedman, N. and Koller, D. (2003). Being Bayesian about network structure: A Bayesian approach to structure discovery in Bayesian networks. Machine Learning 50, 95-126 https://doi.org/10.1023/A:1020249912095
  20. Geiger, D., Meek, C., and Sturmfels, B. (2002). On the toric algebra of graphical models. Microsoft Research: MSR-TR, 47
  21. Gentle, J E. (2003). Random Number Generation and Monte Carlo Methods. Springer-Verlag, New York
  22. Hahn, M.W., Stajich, J., and Wray, G.A. (2003). The effects of selection against spurious transcription factor binding sites. Molecular Biology and EVOlution 20, 901-906 https://doi.org/10.1093/molbev/msg096
  23. Hall, I.M., et al. (2002). Establishment and maintenance of a heterochromatin domain. Science 297, 2232-2237 https://doi.org/10.1126/science.1076466
  24. Hampson, S., Kibler, D., and Baldi, P. (2002). Distribution patterns of over-represented k-mers in non-coding yeasl DNA. Bioinformatics 18, 513-528 https://doi.org/10.1093/bioinformatics/18.4.513
  25. Hansen, T. (1997). Stabilizing selection and the comparative analysis of adaptation. Evolution 51, 1341-1351 https://doi.org/10.2307/2411186
  26. Hansen, T. and Martins, E. (1996). Translating between microevolutionary process and macroevolutionary patterns: the correlation structure of interspecific data. Evolution SO, 1404-1417
  27. Harris, M. A et al. (2004). The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 32, D258-261 https://doi.org/10.1093/nar/gkh036
  28. Herbert, A (2004). The four Rs of RNA-directed evolution. Nat. Genetics 36, 19-25 https://doi.org/10.1038/ng1275
  29. Hillis, D.M. (2000). AIDS. Origins of HIV. Science 288, 1757-1759 https://doi.org/10.1126/science.288.5472.1757
  30. Hughes, T.R, Marton, M.J., Jones, AR, Roberts, C.J., Stoughton, R, Armour, C.D., Bennett, HA, Coffey, E., Dai, H., He, Y.D., et al. (2000). Functional discovery via a compendium of expression profiles. Cell 102, 109-126 https://doi.org/10.1016/S0092-8674(00)00015-5
  31. Jareborg, N., Birney, E., and Durbin, R. (1999). Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs. Genome Res. 9, 815-824 https://doi.org/10.1101/gr.9.9.815
  32. Kennedy, D. N., Makris, N., Herbert, M. R., Takahashi, T., and Cavness, V. S. (2002). Basic principles of MRI and morphometry studies of human brain development. Developmental Science 5, 268-278 https://doi.org/10.1111/1467-7687.00366
  33. Kent, W.J., Baertsch, R Hinrichs, A, Miller, W., and Haussler, D. (2003). Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc. Natl. Acad. Sci. USA 100, 11484-11489 https://doi.org/10.1073/pnas.1932072100
  34. Kerr, M. K. and Churchill, G. A (2001). Experimental design for gene expression microarrays. Biostatistics 2, 183-201 https://doi.org/10.1093/biostatistics/2.2.183
  35. Kim, J. (2000). Slicing hyperdimensional oranges: The geometry of phylogenetic estimation. Mol. Phyl. Evol. 17, 58-75 https://doi.org/10.1006/mpev.2000.0816
  36. Kim, J. (2001). Macroevolution of the hairy enhancer in Drosophila species. J. Exp. Zool. 291, 175-185 https://doi.org/10.1002/jez.1067
  37. Kim, J., Moriyama, E., Warr, C.G., Clyne, P.J.,and Carlson, J.R. (2000). Identification of multi-transmembrane proteins from genomic databases using quasi-periodic structural properties. Bioinformatics 16, 767-775 https://doi.org/10.1093/bioinformatics/16.9.767
  38. Kissinger, J.C., Brunk, B.P., Crabtree, J., Fraunholz, M.J., Gajria, B., Milgram, AJ., Pearson, D.S., Schug, J., Bahl, A., Diskin, S.J., Ginsburg, H., Grant, G.R, Gupta, D., Labo, p.,Li, L, Mailman, M.D., McWeeney, S.K., Whetzel, P., Stoeckert, C.J., and Roos, D.S. (2002). PlasmoDB: The Plasmodium genome database. Nature 419, 490-492. Also (http:// PlasmoDB.org) https://doi.org/10.1038/419490a
  39. Lande, R (1979). Quantitative genetics analysis of multivariate evolution, applied to brain:body size allometry. Evolution 33, 402-416 https://doi.org/10.2307/2407630
  40. Lande, R and Arnold, S. (1983). The measurement of selection on correlated characters. Evolution 37, 1210-1226 https://doi.org/10.2307/2408842
  41. Lauritzen, S. (1996). Graphical Models Clarendon Press, Oxford
  42. Levy, S., Hannenhalli, S., and Workman, C. (2001). Enrichment of regulatory signals in conserved non-coding genomics sequences. Bioinformatics 17, 871-7 https://doi.org/10.1093/bioinformatics/17.10.871
  43. Lynch, M. (2002). Intron evolution as a population-genetic process. Proc. Nat!. Acad. Sci. USA 99, 6118-6123 https://doi.org/10.1073/pnas.092595699
  44. Mantripragada, K. K, Buckley, p. G., Diaz de Stahl, T. D., and Dumanski, J. P. (2004). Genomic microarrays in the spotlight. Trends in Genetics 20. 87-94 https://doi.org/10.1016/j.tig.2003.12.008
  45. Metzker, M.L., Mindell, D.P., Liu, X.M., Ptak, RG., Gibbs, RA, and Hillis, D.M. (2002). Molecular evidence of HIV-1 transmission ina criminal case. Proc. Natl. Acad. Sci. USA 99, 14292-14297 https://doi.org/10.1073/pnas.222522599
  46. Ohta, T. (1997). The meaning of near-neutrality at coding and non-coding regions. Gene 205, 261-7 https://doi.org/10.1016/S0378-1119(97)00396-X
  47. Rahm, E. and Bernstein, PA (2001). A survey of approaches to automated schema matching. VLDB Journal 10, 334-350 https://doi.org/10.1007/s007780100057
  48. Rifkin, S. A, Atteson, K., and Kim. J. (2000). Constraint structure analysis of gene expression. Functional and Integrative Genomics 1, 174-185 https://doi.org/10.1007/s101420000018
  49. Rifkin, S.A., Kim, J., and White, K.P. (2003). Evolution of gene expression in the Drosophila melanogaster subgroup. Nat Genetics 33, 138-144 https://doi.org/10.1038/ng1086
  50. Rivas, E., Klein, R.J., Jones, TA, and Eddy S.R. (2001). Computational identification of non-coding RNAs in E. coli by comparative genomics. Curro BioI. 11, 1369-1373 https://doi.org/10.1016/S0960-9822(01)00401-8
  51. Rogozin, I.B., Makarova, K.S., Natale, D.A., Spiridonov, A.N., Tatusov, R.L., Wolf, Y.I., Yin, J., and Koonin, E.V. (2002). Congruent evolution of different classes of non-coding DNA in prokaryotic genomes. Nucleic Acids Res. 30, 4264-71 https://doi.org/10.1093/nar/gkf549
  52. Stein, L.D. (2003). Integrating biological databases, Nature Review.s Genetics 4, 337-345 https://doi.org/10.1038/nrg1065
  53. Troyanskaya, O.G., Dolinski, K., Owen, A.B., Altman, R.B., and Botstein, D. (2003). A Bayesian framework for combining heterogeneous data sources for gene function prediction (in saccharomyces cerevisiae). Proc. Natl. Acad. Sci. USA 100, 8348-53 https://doi.org/10.1073/pnas.0832373100
  54. Wagner, G. P.(2OO1). The Charaefer Concept in Evolutionary Biology, Academic Press, San Diego
  55. Wagner, G.P. and Stadler, P.F. (2003). Quasi-independence, homology and the unity 01 type: A topological theory of characters. Journal of Theoretical Biology 220, 505-527 https://doi.org/10.1006/jtbi.2003.3150
  56. Wang, Y. and Guo, S. (2004). Statistical Methods for Detecting Genomic Alterations Through Array-Based Comparative Genomic Hybridization (CGH). Frontiers in Bioscience 9, 540-549 https://doi.org/10.2741/1186
  57. Wasserman, W.W., Palumbo, M., Thompson, W., Fickett, J.W., and Lawrence, C.E. (2000). Human-mouse genome comparisons to locate regulatory sites. Nat. Genet. 26, 225-227 https://doi.org/10.1038/79965
  58. West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R, Zuzan, H., Olson, J. A, Jr., Marks, J.R, and Nevins, J.R (2001). Predicting the clinical status of human breast cancer by using gene expression profiles. Proc. Natl. Acad. Sci. USA 98, 11462-11467 https://doi.org/10.1073/pnas.201162998
  59. Yang, Y. H., Dudoit, S., Luu, P., Lin, D. M., Peng, V., Ngai, J., and Speed, T. P. (2002). Normalization for eDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 30, e15 https://doi.org/10.1093/nar/30.4.e15
  60. Yarrow, J. C., Feng, Y., Perlman, Z. E., Kirchhausen, T., and Mitchison, T. J. (2003). Phenotypic screening of small molecule libraries by high throughput cell imaging. Comb. Chem. High Throughput Screen 6, 279-286 https://doi.org/10.2174/138620703106298527