Significant Gene Selection Using Integrated Microarray Data Set with Batch Effect

  • Kim Ki-Yeol (Oral Cancer Research Institute, Yonsei University College of Dentistry) ;
  • Chung Hyun-Cheol (Department of Internal Medicine, Yonsei University College of Medicine, Brain Korea 21 Project for Medical Science, Yonsei University College of Medicine, Cancer Metastasis Research Center, Yonsei University College of Medicine, Yonsei Cancer Center, Yonsei University College of Medicine) ;
  • Jeung Hei-Cheul (Cancer Metastasis Research Center, Yonsei University College of Medicine) ;
  • Shin Ji-Hye (Cancer Metastasis Research Center, Yonsei University College of Medicine) ;
  • Kim Tae-Soo (Cancer Metastasis Research Center, Yonsei University College of Medicine, Yonsei Cancer Center, Yonsei University College of Medicine) ;
  • Rha Sun-Young (Brain Korea 21 Project for Medical Science, Yonsei University College of Medicine, Cancer Metastasis Research Center, Yonsei University College of Medicine)
  • Published : 2006.09.01

Abstract

In microarray technology, many diverse experimental features can cause biases including RNA sources, microarray production or different platforms, diverse sample processing and various experiment protocols. These systematic effects cause a substantial obstacle in the analysis of microarray data. When such data sets derived from different experimental processes were used, the analysis result was almost inconsistent and it is not reliable. Therefore, one of the most pressing challenges in the microarray field is how to combine data that comes from two different groups. As the novel trial to integrate two data sets with batch effect, we simply applied standardization to microarray data before the significant gene selection. In the gene selection step, we used new defined measure that considers the distance between a gene and an ideal gene as well as the between-slide and within-slide variations. Also we discussed the association of biological functions and different expression patterns in selected discriminative gene set. As a result, we could confirm that batch effect was minimized by standardization and the selected genes from the standardized data included various expression pattems and the significant biological functions.

Keywords

References

  1. Alter, O., Brown, P.O., and Botstein, D. (2000). Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. USA 97, 10101-10106
  2. Benito, M., Parker, J., Du, Q., Wu, J., Xiang, D., Perou, C.M., and Marron, J.S. (2004). Adjustment of systematic microarray data biases. Bioinformatics 20, 105-114 https://doi.org/10.1093/bioinformatics/btg385
  3. Breiman, L. (2001). Random Forests. Berkeley, CA, Statistics Department, University of California 1-33
  4. Breitling, R., Sharif, O., Hartman, M.L., and Krisans, S.K. (2002). Loss of compartmentalization causes misregulation of lysine biosynthesis in peroxisome-deficient yeast cells. Eukaryot. Cell 1, 978-986 https://doi.org/10.1128/EC.1.6.978-986.2002
  5. Choi, J.K., Yu, U., Kim, S., and Yoo, O.J. (2003). Combining multiple microarray studies and modeling interstudy variation. Bioinformatics 19, 184-190
  6. Detours, V., Dumont, J.E., Bersini, H., and Maenhaut, C. (2003). Integration and cross-validation of high-throughput gene expression data: Comparing heterogeneous data sets. FEBS Lett. 546, 98-102 https://doi.org/10.1016/S0014-5793(03)00522-2
  7. EASE (Expression Analysis Systematic Explorer). http://david.niaid.nih.gov/david/
  8. Kanji, G.K. (1993). 100 Statistical Tests. (London, Thousand Oaks, New Delhi, SAGE publication)
  9. Kim, T.M., Jeong, H.J., Seo, M.Y., Kim, S.C., Cho, G., Park, K.H., et al. (2005). Determination of genes related to gastrointestinal tract origin cancer cells using a cDNA microarray. Clin Cancer Res. 11, 79-86
  10. Lee, P.D., Sladek, R., Greenwood, C.M., and Hudson, T.J. (2002). Control genes and variability: Absence of ubiquitous reference transcripts in diverse mammalian expression studies. Genome Res. 12, 292-297 https://doi.org/10.1101/gr.217802
  11. Nielsen, T.O., West, R.B., Linn, S.C., Alter, O., Knowling, M.A., O'Connell, J.X., Zhu, S., Fero, M., Sherlock, G., Pollack, J.R., Brown, P.O., Botstein, D., and van de Rijn, M. (2002). Molecular characterisation of soft tissue tumours: a gene expression study. Lancet 359, 1301-1307 https://doi.org/10.1016/S0140-6736(02)08270-3
  12. R: A language and environment for statistical computing. http://www.R-project.org
  13. Ramaswamy, S., Ross, K.N., Lander, E.S., and Golub, T.R. (2003). A molecular signature of metastasis in primary solid tumors. Nat. Genet. 33, 49-54 https://doi.org/10.1038/ng1060
  14. Rhodes, D.R., Barrette, T.R., Rubin, M.A., Ghosh, D., and Chinnaiyan, A.M. (2002). Meta-analysis of microarrays: Interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. Cancer Res. 62, 4427-4433
  15. Sorlie, T., Tibshirani, R., Parker, J., Hastie, T., Marron, J.S., Nobel, A., Deng, S., Johnsen, H., Pesich, R., Geisler, S., et al. (2003). Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc. Natl. Acad. Sci. USA 100, 8418-8423
  16. Xin, W., Rhodes, D.R., Ingold, C., Chinnaiyan, A.M., and Rubin, M.A. (2003). Dysregulation of the annexin family protein family is associated with prostate cancer progression. Am. J. Pathol. 162, 255-261 https://doi.org/10.1016/S0002-9440(10)63816-3
  17. Yuen, T., Wurmbach, E., Pfeffer, R.L., Ebersole, B.J., and Sealfon, S.C. (2002). Accuracy and calibration of commercial oligonucleotide and custom cDNA microarrays. Nucleic Acids Res. 30, e48 https://doi.org/10.1093/nar/30.10.e48