Building a Classifier for Integrated Microarray Datasets through Two-Stage Approach

2 단계 접근법을 통한 통합 마이크로어레이 데이타의 분류기 생성

  • 윤영미 (연세대학교 컴퓨터과학과) ;
  • 이종찬 (연세대학교 컴퓨터과학과) ;
  • 박상현 (연세대학교 컴퓨터과학과)
  • Published : 2007.02.15

Abstract

Since microarray data acquire tens of thousands of gene expression values simultaneously, they could be very useful in identifying the phenotypes of diseases. However, the results of analyzing several microarray datasets which were independently carried out with the same biological objectives, could turn out to be different. One of the main reasons is attributable to the limited number of samples involved in one microarry experiment. In order to increase the classification accuracy, it is desirable to augment the sample size by integrating and maximizing the use of independently-conducted microarray datasets. In this paper, we propose a novel two-stage approach which firstly integrates individual microarray datasets to overcome the problem caused by limited number of samples, and identifies informative genes, secondly builds a classifier using only the informative genes. The classifier from large samples by integrating independent microarray datasets achieves high accuracy up to 24.19% increase as against other comparison methods, sensitivity, and specificity on independent test sample dataset.

마이크로어레이 데이타는 동시에 수 만개 유전자의 발현 값을 포함하고 있기 때문에 질병의 발현 형질 분류에 매우 유용하게 쓰인다. 그러나 동일한 생물학적 주제라 할지라도 여러 독립된 연구 집단에서 생성된 마이크로어레이의 분석결과는 서로 다르게 나타날 수 있다. 이에 대한 주된 이유는 하나의 마이크로어레이 실험에 참여한 샘플의 수가 제한적이기 때문이다. 따라서 개별적으로 수행된 마이크로어레이 데이타를 통합하여 샘플의 수를 늘리는 것은, 보다 정확한 분석을 하는데 있어 매우 중요하다. 본 연구에서는 이에 대한 해결 방안으로 두 단계 접근방법을 제안한다. 제 1 단계에서는 개별적으로 생성된 동일주제의 마이크로어레이 데이타를 통합한 후 인포머티브(Informative) 유전자를 추출하고 제 2 단계에서는 인포머티브 유전자만을 이용하여 클래스 분류(Classification) 과정 후 분류자를 추출한다. 이 분류자를 다른 테스트 샘플 데이타에 적용한 실험결과를 보면 마이크로어레이 데이타를 통합하여 샘플의 수를 증가시킬수록, 비교 방법에 비해 정확도가 최대 24.19% 높은 분류자를 만들어 내는 것을 알 수 있다.

Keywords

References

  1. S. Mukherjee, P. Tamayo, S. Rogers, R. Rifkin, A. Engle, C. Campbell, T. Golub and J. Mesirov, 'Estimating dataset size requirements for classifying DNA microarray data', Journal of Computational Biology, vol. 10, pp. 119-142, 2003 https://doi.org/10.1089/106652703321825928
  2. L. Xu, A. Tan, D. Naiman, D. Geman and R. Winslow, 'Robust prostate cancer marker genes emerge from direct integration of inter-study microarray data', Bioinformatics, vol. 21, pp. 3905-3911, 2005 https://doi.org/10.1093/bioinformatics/bti647
  3. J. K. Choi, U. Yu, S. Kim and O. J. Yoo, 'Combining multiple microarray studies and modeling interstudy variation', Bioinformatics, vol. 19, pp. 84-90, 2003 https://doi.org/10.1093/bioinformatics/btg1010
  4. H. Jiang, Y. Deng, H.S. Chen, L. Tao, Q. Sha and J. Chen, 'Joint analysis of two microarray geneexpression data sets to select lung adenocarcinoma marker genes,' BMC Bioinformatics, vol. 5, pp. 81-92, 2004 https://doi.org/10.1186/1471-2105-5-81
  5. J. Kang, J. Yang, W. Xu, and P. Chopra, 'Integrating heterogeneous microarray data sources using correlation signatures,' In International Workshop on Data Integration in the Life Sciences (DILS), 2005
  6. S. Dudoit and J. Fridlyand, 'Classication in microarray experiments,' Statistical Analysis of Gene Expression Microarray Data, Chapman and Hall, 2003
  7. C. Tang, A. Zhang and J. Pei, 'Mining Phenotypes and Informative Genes from Gene Expression Data,' ACM SIGKDD, pp. 24-27, Washington, DC, USA, August 2003
  8. A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer and Z. Yakhini, 'Tissue classification with gene expression profiles,' Journal of Computational Biology, vol. 7, pp. 559-583, 2000 https://doi.org/10.1089/106652700750050943
  9. C. Bishop, 'Neural networks for pattern recognition,' Oxford University Press, New York, 1995
  10. T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Collier, M. L. Loh, J. R. Downing and M. A. Caligiuri, 'Molecular classification of cancer: class discovery and class prediction by gene expression monitoring,' Science, vol. 286, pp. 531-537, 1999 https://doi.org/10.1126/science.286.5439.531
  11. P. J. Park, M. Pagano and M. Bonetti, 'A nonparametric scoring algorithm for identifying informative genes from microarray data,' Pacific Symposium on Biocomputing, pp. 52-63, 2001
  12. I. H. Witten, and E. Frank, 'Data mining: practical machine learning tools and techniques with Java implementations,' Morgan Kaufmann, 1999
  13. M. Robnik-Sikonja, and I. Kononenko, 'Theoretical and empirical analysis of ReliefF and RReliefF,' Machine Learning, vol. 53, pp.23-69, 2003 https://doi.org/10.1023/A:1025667309714
  14. N. Bailey, 'Statistical methods in biology,' Cambridge university press, 1995
  15. C. C. Chang and C. J. Lin, 'LIBSVM: a library for support vector machine,' 2001, http://www.csie.ntu.edu.tw/~cjlin/libsvm
  16. V. Vapnik, 'Statistical Learning Theory,' John Wiley & Sons, New York, 1999
  17. B. Dasarathy, 'Nearest Neighbor Norms: NN Pattern Classification Techniques,' IEEE Computer Society Press, Los Alamitos, CA, USA. 1991
  18. A. Tan, D. Naiman, L. Xu, R. Winslow and D. Geman. 'Simple decision rules for classifying human cancers from gene expression profiles,' Bioinformatics, vol. 21, pp. 3896-3904, 2005 https://doi.org/10.1093/bioinformatics/bti631
  19. D. Singh, P. G. Febbo, K. Ross,D. G. Jackson, J. Manola and C. Ladd, 'Gene expression correlates of clinical prostate cancer behavior,' Cancer Cell, vol. 1, pp. 203-209, 2002 https://doi.org/10.1016/S1535-6108(02)00030-2
  20. J. B. Welsh, L. M. Sapinoso, A. I. Su, S. G. Kern, J. Wang-Rodriguez and C. A. Moskaluk, 'Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer', Cancer Res., vol. 61, pp. 5974-5978, 2001
  21. E. LaTulippe, J. Satagopan, A. Smith, H. Scher, P. Scardino and V. Reuter, 'Comprehensive gene expression analysis of prostate cancer reveals distinct transcriptional programs associated with metastatic disease,' Cancer Res, vol. 62, pp. 4499-4506, 2002
  22. L. Li, W. Leping, C. R. Weinberg, T. A. Darden and L. G. Pedersen, 'Gene Selection for Sample Classification Based on Gene Expression Data: Study of Sensitivity to Choice of Parameters of the ga/knn Method,' Bioinformatics, vol. 17, pp. 1131-1142, 2001 https://doi.org/10.1093/bioinformatics/17.12.1131