Browse > Article

Computational Challenges for Integrative Genomics  

Kim, Junhyong (Department of Biology, Penn Center for Bioinformatics, University of Pennsylvania)
Magwene, Paul (Department of Computer and Information Science, Penn Center for Bioinformatics, University of Pennsylvania)
Abstract
Integrated genomics refers to the use of large-scale, systematically collected data from various sources to address biological and biomedical problems. A critical ingredient to a successful research program in integrated genomics is the establishment of an effective computational infrastructure. In this review, we suggest that the computational infrastructure challenges include developing tools for heterogeneous data organization and access, innovating techniques for combining the results of different analyses, and establishing a theoretical framework for integrating biological and quantitative models. For each of the three areas - data integration, analyses integration, and model integration - we review some of the current progress and suggest new topics of research. We argue that the primary computational challenges lie in developing sound theoretical foundations for understanding the genome rather than simply the development of algorithms and programs.
Keywords
Integrative genomics; computational biology; bioinformatics; probabilistic modeling;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Allman, E.S. and Rhodes, J.A .(2003). Phylogenetic Invariants for the General Markov Model of Sequence Mutation. Math, Biasci. 186, 113-144   DOI   ScienceOn
2 Gentle, J E. (2003). Random Number Generation and Monte Carlo Methods. Springer-Verlag, New York
3 Hillis, D.M. (2000). AIDS. Origins of HIV. Science 288, 1757-1759   DOI   PUBMED
4 Hughes, T.R, Marton, M.J., Jones, AR, Roberts, C.J., Stoughton, R, Armour, C.D., Bennett, HA, Coffey, E., Dai, H., He, Y.D., et al. (2000). Functional discovery via a compendium of expression profiles. Cell 102, 109-126   DOI   ScienceOn
5 Kissinger, J.C., Brunk, B.P., Crabtree, J., Fraunholz, M.J., Gajria, B., Milgram, AJ., Pearson, D.S., Schug, J., Bahl, A., Diskin, S.J., Ginsburg, H., Grant, G.R, Gupta, D., Labo, p.,Li, L, Mailman, M.D., McWeeney, S.K., Whetzel, P., Stoeckert, C.J., and Roos, D.S. (2002). PlasmoDB: The Plasmodium genome database. Nature 419, 490-492. Also (http:// PlasmoDB.org)   DOI   ScienceOn
6 Lande, R (1979). Quantitative genetics analysis of multivariate evolution, applied to brain:body size allometry. Evolution 33, 402-416   DOI   ScienceOn
7 Lande, R and Arnold, S. (1983). The measurement of selection on correlated characters. Evolution 37, 1210-1226   DOI   ScienceOn
8 Mantripragada, K. K, Buckley, p. G., Diaz de Stahl, T. D., and Dumanski, J. P. (2004). Genomic microarrays in the spotlight. Trends in Genetics 20. 87-94   DOI   ScienceOn
9 Troyanskaya, O.G., Dolinski, K., Owen, A.B., Altman, R.B., and Botstein, D. (2003). A Bayesian framework for combining heterogeneous data sources for gene function prediction (in saccharomyces cerevisiae). Proc. Natl. Acad. Sci. USA 100, 8348-53   DOI   ScienceOn
10 Wasserman, W.W., Palumbo, M., Thompson, W., Fickett, J.W., and Lawrence, C.E. (2000). Human-mouse genome comparisons to locate regulatory sites. Nat. Genet. 26, 225-227   DOI   ScienceOn
11 Kim, J., Moriyama, E., Warr, C.G., Clyne, P.J.,and Carlson, J.R. (2000). Identification of multi-transmembrane proteins from genomic databases using quasi-periodic structural properties. Bioinformatics 16, 767-775   DOI   ScienceOn
12 Ohta, T. (1997). The meaning of near-neutrality at coding and non-coding regions. Gene 205, 261-7   DOI   PUBMED   ScienceOn
13 Rifkin, S. A, Atteson, K., and Kim. J. (2000). Constraint structure analysis of gene expression. Functional and Integrative Genomics 1, 174-185   DOI   ScienceOn
14 Levy, S., Hannenhalli, S., and Workman, C. (2001). Enrichment of regulatory signals in conserved non-coding genomics sequences. Bioinformatics 17, 871-7   DOI   ScienceOn
15 Harris, M. A et al. (2004). The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 32, D258-261   DOI   PUBMED   ScienceOn
16 Durbin, R, Eddy, S., Krogh, A, and Mitchison, G. (1998). Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press
17 Farach, M. and Kannan, S. (1999). Efficient algorithms for inverting evolution. JACM 46, 437-450   DOI   ScienceOn
18 Herbert, A (2004). The four Rs of RNA-directed evolution. Nat. Genetics 36, 19-25   DOI   PUBMED   ScienceOn
19 Kent, W.J., Baertsch, R Hinrichs, A, Miller, W., and Haussler, D. (2003). Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc. Natl. Acad. Sci. USA 100, 11484-11489   DOI   ScienceOn
20 Kim, J. (2000). Slicing hyperdimensional oranges: The geometry of phylogenetic estimation. Mol. Phyl. Evol. 17, 58-75   DOI   PUBMED   ScienceOn
21 Kennedy, D. N., Makris, N., Herbert, M. R., Takahashi, T., and Cavness, V. S. (2002). Basic principles of MRI and morphometry studies of human brain development. Developmental Science 5, 268-278   DOI   ScienceOn
22 Geiger, D., Meek, C., and Sturmfels, B. (2002). On the toric algebra of graphical models. Microsoft Research: MSR-TR, 47
23 Rogozin, I.B., Makarova, K.S., Natale, D.A., Spiridonov, A.N., Tatusov, R.L., Wolf, Y.I., Yin, J., and Koonin, E.V. (2002). Congruent evolution of different classes of non-coding DNA in prokaryotic genomes. Nucleic Acids Res. 30, 4264-71   DOI   ScienceOn
24 Wang, Y. and Guo, S. (2004). Statistical Methods for Detecting Genomic Alterations Through Array-Based Comparative Genomic Hybridization (CGH). Frontiers in Bioscience 9, 540-549   DOI
25 Kim, J. (2001). Macroevolution of the hairy enhancer in Drosophila species. J. Exp. Zool. 291, 175-185   DOI   ScienceOn
26 Cooper, G. M., Brudno, M., NISC Comparative Sequencing Program, Green, E.D., Batzoglou, S., and Sidow, A (2003). Quantitative estimates of sequence divergence for comparative analyses of mammalian genomes. Genome Res. 13, 813-820   DOI   ScienceOn
27 Rifkin, S.A., Kim, J., and White, K.P. (2003). Evolution of gene expression in the Drosophila melanogaster subgroup. Nat Genetics 33, 138-144   DOI   ScienceOn
28 Ben-Dor, A, Pe'er, I., Shamir, R, and Sharan, R. (2001). On the complexity of positional sequencing by hybridization. J. Comput. Bioi. 8, 361-71   DOI   ScienceOn
29 Jareborg, N., Birney, E., and Durbin, R. (1999). Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs. Genome Res. 9, 815-824   DOI   ScienceOn
30 Wagner, G. P.(2OO1). The Charaefer Concept in Evolutionary Biology, Academic Press, San Diego
31 Davidson, S.B., Crabtree, J., Brunk, B.P., Schug, J., Tannen, V., Overton, G.C., and Stoeckert, C.J.Jr. (2001). K2/Kleisli and GUS: Experiments in Integrated Access to Genomic Data Sources. IBM Systems Journal 40, 512-531   DOI   ScienceOn
32 Hansen, T. and Martins, E. (1996). Translating between microevolutionary process and macroevolutionary patterns: the correlation structure of interspecific data. Evolution SO, 1404-1417
33 Rivas, E., Klein, R.J., Jones, TA, and Eddy S.R. (2001). Computational identification of non-coding RNAs in E. coli by comparative genomics. Curro BioI. 11, 1369-1373   DOI   ScienceOn
34 Metzker, M.L., Mindell, D.P., Liu, X.M., Ptak, RG., Gibbs, RA, and Hillis, D.M. (2002). Molecular evidence of HIV-1 transmission ina criminal case. Proc. Natl. Acad. Sci. USA 99, 14292-14297   DOI   ScienceOn
35 Bookstein, F. L. (1991). Morphometric tools for landmark data. Cambridge University Press, New York
36 Hahn, M.W., Stajich, J., and Wray, G.A. (2003). The effects of selection against spurious transcription factor binding sites. Molecular Biology and EVOlution 20, 901-906   DOI   ScienceOn
37 Kerr, M. K. and Churchill, G. A (2001). Experimental design for gene expression microarrays. Biostatistics 2, 183-201   DOI   ScienceOn
38 Collins, F.S., Morgan, M., and Patrinos, A (2003). The human genome project: lesions from large-scale biology. Science 300, 286-290   DOI   PUBMED   ScienceOn
39 Cryan, M., Goldberg, L., et al. (1998). Evolutionary Trees can be Learned in Polynomial Time in the Two-State General Markov Model. IEEE Symposium on Foundations of Computer Science, 436-445
40 Drawid, A. and Gerstein, M. (2003). A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome. J. Mol. Bioi. 301, 1059-1075
41 Lauritzen, S. (1996). Graphical Models Clarendon Press, Oxford
42 Brat, A, Hoffmann, F., Kriegel, K., Schultz, C., and Wenk, C. (2002). Geometric algorithms for the analysis of 2D-electrophoresis gels. J. Comput. Bioi. 9, 299-315   DOI   ScienceOn
43 Foth, B.J., Ralph, SA, Tonkin, C.J., Struck, N.S., Fraunholz, M., Roos, D.S., Cowman, A.F., and McFadden, G.I. (2003). Dissecting apicoplast targeting in the malaria parasite Plasmodium falciparum. Science 299, 705-8   DOI   PUBMED   ScienceOn
44 Hall, I.M., et al. (2002). Establishment and maintenance of a heterochromatin domain. Science 297, 2232-2237   DOI   PUBMED   ScienceOn
45 Deutsch, A, et al. (1999). Physical data independence, constraints, and optimization with universal plans. Proc. VLDB
46 Carter, A.J.R and Wagner, G.P. (2002). EVOlution of functionally conserved enhancers can be accelerated in large populations: a population genetic model. Prcx:. Roy Soc., BioI. 169, 953-960
47 Felsenstein, J. (2003). Inferring phylogenies. Sinaur, Sunderland, MA
48 Hansen, T. (1997). Stabilizing selection and the comparative analysis of adaptation. Evolution 51, 1341-1351   DOI   ScienceOn
49 West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R, Zuzan, H., Olson, J. A, Jr., Marks, J.R, and Nevins, J.R (2001). Predicting the clinical status of human breast cancer by using gene expression profiles. Proc. Natl. Acad. Sci. USA 98, 11462-11467   DOI   ScienceOn
50 Ashburner, M., et al. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25-29   DOI   ScienceOn
51 Stein, L.D. (2003). Integrating biological databases, Nature Review.s Genetics 4, 337-345   DOI
52 Wagner, G.P. and Stadler, P.F. (2003). Quasi-independence, homology and the unity 01 type: A topological theory of characters. Journal of Theoretical Biology 220, 505-527   DOI   ScienceOn
53 Alizadeh, A A, Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A, Boldrick, J. C., Sabet, H., Tran, T., Yu, X., et al. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503-511   DOI   ScienceOn
54 Chi, J.T., et al. (2003). Genomewide view of gene silencing by small interfering RNAs. Proc. Nat!. Acad. Sci. USA 100, 6343-6346   DOI   ScienceOn
55 Hampson, S., Kibler, D., and Baldi, P. (2002). Distribution patterns of over-represented k-mers in non-coding yeasl DNA. Bioinformatics 18, 513-528   DOI   ScienceOn
56 Yang, Y. H., Dudoit, S., Luu, P., Lin, D. M., Peng, V., Ngai, J., and Speed, T. P. (2002). Normalization for eDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 30, e15   DOI   PUBMED   ScienceOn
57 Rahm, E. and Bernstein, PA (2001). A survey of approaches to automated schema matching. VLDB Journal 10, 334-350   DOI   ScienceOn
58 Yarrow, J. C., Feng, Y., Perlman, Z. E., Kirchhausen, T., and Mitchison, T. J. (2003). Phenotypic screening of small molecule libraries by high throughput cell imaging. Comb. Chem. High Throughput Screen 6, 279-286   DOI   ScienceOn
59 Friedman, N. and Koller, D. (2003). Being Bayesian about network structure: A Bayesian approach to structure discovery in Bayesian networks. Machine Learning 50, 95-126   DOI   ScienceOn
60 Lynch, M. (2002). Intron evolution as a population-genetic process. Proc. Nat!. Acad. Sci. USA 99, 6118-6123   DOI   ScienceOn