Browse > Article
http://dx.doi.org/10.5808/GI.2011.9.1.019

Standard-based Integration of Heterogeneous Large-scale DNA Microarray Data for Improving Reusability  

Jung, Yong (Seoul National University Biomedical Informatics)
Seo, Hwa-Jeong (Medical Informatics, Graduate School of Public Health, Gachon University of Medicine and Science)
Park, Yu-Rang (Seoul National University Biomedical Informatics)
Kim, Ji-Hun (Seoul National University Biomedical Informatics)
Bien, Sang Jay (Seoul National University Biomedical Informatics)
Kim, Ju-Han (Seoul National University Biomedical Informatics)
Abstract
Gene Expression Omnibus (GEO) has kept the largest amount of gene-expression microarray data that have grown exponentially. Microarray data in GEO have been generated in many different formats and often lack standardized annotation and documentation. It is hard to know if preprocessing has been applied to a dataset or not and in what way. Standard-based integration of heterogeneous data formats and metadata is necessary for comprehensive data query, analysis and mining. We attempted to integrate the heterogeneous microarray data in GEO based on Minimum Information About a Microarray Experiment (MIAME) standard. We unified the data fields of GEO Data table and mapped the attributes of GEO metadata into MIAME elements. We also discriminated non-preprocessed raw datasets from others and processed ones by using a two-step classification method. Most of the procedures were developed as semi-automated algorithms with some degree of text mining techniques. We localized 2,967 Platforms, 4,867 Series and 103,590 Samples with covering 279 organisms, integrated them into a standard-based relational schema and developed a comprehensive query interface to extract. Our tool, GEOQuest is available at http://www.snubi.org/software/GEOQuest/.
Keywords
gene expression data; data integration; classification;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Yoon, S., Yang, Y., Choi, J., and Seong, J. (2006). Large scale data mining approach for gene-specific standardization of microarray gene expression data. Bioinformatics 22, 2898-2904.   DOI
2 Quackenbush, J. (2002). Microarray data normalization and transformation. Nat. Genet. 32 Suppl, 496-501.   DOI
3 Rayner, T.F., Rocca-Serra, P., Spellman, P.T., Causton, H.C., Farne, A., Holloway, E., Irizarry, R.A., Liu, J., Maier, D.S., Miller, M., Petersen, K., Quackenbush, J., Sherlock, G., Stoeckert, C. J., Jr., White, J., Whetzel, P. L., Wymore, F., Parkinson, H., Sarkans, U., Ball, C. A. and Brazma, A. (2006). A simple spreadsheet-based, MIAMEsupportive format for microarray data: MAGE-TAB. BMC Bioinformatics 7, 489.   DOI
4 Sean, D., and Meltzer, P.S. (2007). GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics 23, 1846-1847.   DOI
5 Spellman, P.T., Miller, M., Stewart, J., Troup, C., Sarkans, U., Chervitz, S., Bernhart, D., Sherlock, G., Ball, C., Lepage, M., Swiatek, M., Marks, W. L., Goncalves, J., Markel, S., Iordan, D., Shojatalab, M., Pizarro, A., White, J., Hubley, R., Deutsch, E., Senger, M., Aronow, B. J., Robinson, A., Bassett, D., Stoeckert, C. J., Jr. and Brazma, A. (2002). Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol. 3, RESEARCH0046.
6 The Microarray Gene Expression Data (MGED) society. The MIAME checklist [http://www.mged.org/Workgroups/MIAME/miame_checklist.html]
7 Vita, R., Vaughan, K., Zarebski, L., Salimi, N., Fleri, W., Grey, H., Sathiamurthy, M., Mokili, J., Bui, H.H., Bourne, P.E., Ponomarenko, J., de Castro, R., Jr., Chan, R. K., Sidney, J., Wilson, S. S., Stewart, S., Way, S., Peters, B. and Sette, A. (2006). Curation of complex, context- dependent immunological data. BMC Bioinformatics 7, 341.   DOI
8 Wheeler, D.L., Barrett, T., Benson, D.A., Bryant, S.H., Canese, K., Chetvernin, V., Church, D.M., DiCuccio, M., Edgar, R., Federhen, S., Geer, L. Y., Kapustin, Y., Khovayko, O., Landsman, D., Lipman, D. J., Madden, T. L., Maglott, D. R., Ostell, J., Miller, V., Pruitt, K. D., Schuler, G. D., Sequeira, E., Sherry, S. T., Sirotkin, K., Souvorov, A., Starchenko, G., Tatusov, R. L., Tatusova, T. A., Wagner, L. and Yaschenko, E. (2007). Database resources of the National Center for Biotechnology Information. Nucl. Acids Res. 35, D5-12.   DOI
9 Edgar, R., and Barrett, T. (2006). NCBI GEO standards and services for microarray data. Nat. Biotechnol. 24, 1471-1472.   DOI
10 Gollub, J., Ball, C.A., Binkley, G., Demeter, J., Finkelstein, D.B., Hebert, J.M., Hernandez-Boussard, T., Jin, H., Kaloper, M., Matese, J.C., Schroeder, M., Brown, P. O., Botstein, D. and Sherlock, G. (2003). The Stanford Microarray Database: data access and quality assessment tools. Nucl. Acids Res. 31, 94-96.   DOI
11 Humphreys, B.L., Lindberg, D.A., Schoolman, H.M., and Barnett, G.O. (1998). The Unified Medical Language System: an informatics research collaboration. J. Am. Med. Inform. Assoc. 5, 1-11.   DOI
12 Johnson, S.B., Paul, T., and Khenina, A. (1997). Generic database design for patient management information. Proc. AMIA. Annu. Fall. Symp. 22-26.
13 Louie, B., Mork, P., Martin-Sanchez, F., Halevy, A., and Tarczy-Hornoch, P. (2007). Data integration and genomic medicine. J. Biomed. Inform. 40, 5-16.   DOI
14 Martin-Sanchez, F., Iakovidis, I., Norager, S., Maojo, V., de Groen, P., Van der Lei, J., Jones, T., Abraham-Fuchs, K., Apweiler, R., Babic, A., Baud, R., Breton, V., Cinquin, P., Doupi, P., Dugas, M., Eils, R., Engelbrecht,R., Ghazal, P., Jehenson, P., Kulikowski, C., Lampe, K., De Moor, G., Orphanoudakis, S., Rossing, N., Sarachan, B., Sousa, A., Spekowius, G., Thireos, G., Zahlmann, G., Zvarova, J., Hermosilla, I. and Vicente, F. J. . (2004). Synergy between medical informatics and bioinformatics: facilitating genomic medicine for future health care. J. Biomed. Inform. 37, 30-42.   DOI
15 Miotto, O., Tan, T.W., and Brusic, V. (2005). Supporting the curation of biological databases with reusable text mining. Genome Inform. 16, 32-44.
16 Parkinson, H., Kapushesky, M., Shojatalab, M., Abeygunawardena, N., Coulson, R., Farne, A., Holloway, E., Kolesnykov, N., Lilja, P., Lukk, M., Mani, R., Rayner, T., Sharma, A., William, E., Sarkans, U. and Brazma, A. (2007). ArrayExpress--a public database of microarray experiments and gene expression profiles. Nucl. Acids Res. 35, D747-750.   DOI
17 Perou, C.M. (2001). Show me the data! Nat. Genet. 29, 373.   DOI
18 Barrett, T., Troup, D.B., Wilhite, S.E., Ledoux, P., Rudnev, D., Evangelista, C., Kim, I.F., Soboleva, A., Tomashevsky, M., and Edgar, R. (2007). NCBI GEO: mining tens of millions of expression profiles--database and tools update. Nucl. Acids Res. 35, D760-765.   DOI
19 Ball, C.A., and Brazma, A. (2006). MGED standards: work in progress. OMICS 10, 138-144.   DOI
20 Barrett, T., and Edgar, R. (2006). Gene expression omnibus: microarray data storage, submission, retrieval, and analysis. Methods Enzymol. 411, 352-369.   DOI
21 Boyle, J. (2005). Gene-Expression Omnibus integration and clustering tools in SeqExpress. Bioinformatics 21, 2550-2551.   DOI
22 Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C.A., Causton, H.C., Gaasterland, T., Glenisson, P., Holstege, F. C., Kim, I. F., Markowitz, V., Matese, J. C., Parkinson, H., Robinson, A., Sarkans, U., Schulze- Kremer, S., Stewart, J., Taylor, R., Vilo, J. and Vingron, M. (2001). Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat. Genet. 29, 365-371.   DOI
23 Burgarella, S., Cattaneo, D., Pinciroli, F., and Masseroli, M. (2005). MicroGen: a MIAME compliant web system for microarray experiment information and workflow management. BMC Bioinformatics 6 Suppl 4, S6.   DOI
24 Butte, A.J., and Chen, R. (2006). Finding disease-related genomic experiments within an international repository: first steps in translational bioinformatics. AMIA. Annu. Symp. Proc. 106-110.
25 Butte, A.J., and Kohane, I.S. (2006). Creation and implications of a phenome-genome network. Nat. Biotechnol. 24, 55-62.   DOI
26 Chaussabel, D., and Sher, A. (2002). Mining microarray expression data by literature profiling. Genome Biol. 3, RESEARCH0055.
27 Argraves, G.L., Jani, S., Barth, J.L., and Argraves, W.S. (2005). ArrayQuest: a web resource for the analysis of DNA microarray data. BMC Bioinformatics 6, 287.   DOI
28 Chen, D., Muller, H.M., and Sternberg, P.W. (2006). Automatic document classification of biological literature. BMC Bioinformatics 7, 370.   DOI
29 Allison, D.B., Cui, X., Page, G.P., and Sabripour, M. (2006). Microarray data analysis: from disarray to consolidation and consensus. Nat. Rev. Genet. 7, 55-65.   DOI