Searching for Optimal Ensemble of Feature-classifier Pairs in Gene Expression Profile using Genetic Algorithm

유전알고리즘을 이용한 유전자발현 데이타상의 특징-분류기쌍 최적 앙상블 탐색

  • 박찬호 (연세대학교 컴퓨터공학과) ;
  • 조성배 (연세대학교 컴퓨터공학과)
  • Published : 2004.04.01

Abstract

Gene expression profile is numerical data of gene expression level from organism, measured on the microarray. Generally, each specific tissue indicates different expression levels in related genes, so that we can classify disease with gene expression profile. Because all genes are not related to disease, it is needed to select related genes that is called feature selection, and it is needed to classify selected genes properly. This paper Proposes GA based method for searching optimal ensemble of feature-classifier pairs that are composed with seven feature selection methods based on correlation, similarity, and information theory, and six representative classifiers. In experimental results with leave-one-out cross validation on two gene expression Profiles related to cancers, we can find ensembles that produce much superior to all individual feature-classifier fairs for Lymphoma dataset and Colon dataset.

유전발현 데이타는 생명체의 특정 조직에서 채취한 샘플을 microarray상에서 측정한 것으로, 유전자들의 발현 정도가 수치로 나타난 데이타이다. 일반적으로 정상조직과 이상조직에서 관련 유전자들의 발현정도는 차이를 보이기 때문에, 유전발현 데이타를 통하여 질병을 분류할 수 있다. 이러한 분류에 모든 유전자들이 관여하지는 않으므로 관련 유전자를 선별하는 작업인 특징선택이 필요하며, 선택된 유전자들을 적절히 분류하는 방법이 필요하다. 본 논문에서는 상관계수, 유사도, 정보이론 등에 기반을 둔 7가지 특징선택 방법과 대표적인 6가지 분류기에 대하여 특징-분류기 쌍의 최적 앙상블을 탐색하기 위한 유전자 알고리즘 기반 방법을 제안한다. 두 가지 암 관련 유전자 발현 데이타에 대하여 leave-one-out cross validation을 포함한 실험을 해본 결과, 림프종 데이타와 대장암 데이타 모두 단일 특징-분류기 쌍보다 훨씬 우수한 성능을 보이는 앙상블들을 발견할 수 있었다.

Keywords

References

  1. T. R. Golub, et al., 'Molecular classification of cancer class discovery and class prediction by gene-expression monitoring,' Science, vol. 286, no. 15, pp. 531-537, October 1999 https://doi.org/10.1126/science.286.5439.531
  2. L. J. v. Veer, et al., 'Gene expression profiling predicts clinical outcome of breast cancer,' Nature, vol. 415, no. 31, pp. 530-536, January 2002 https://doi.org/10.1038/news020128-6
  3. Y. H. Yang, et al., 'Normalization for cDNA microarray data: A robust composite method addressing single and multiple slide systematic variation,' Nucleic Acids Research, vol. 30, no. 4, e15, pp 1-10, 2002 https://doi.org/10.1093/nar/30.4.e15
  4. L. Li, et al., 'Gene selection for sample classification based on gene expression data: Study of sensitivity to choice of parameters of the GNKNN method,' Bioinformatics, vol. 17, no. 12, pp. 1131-1142, June 2001 https://doi.org/10.1093/bioinformatics/17.12.1131
  5. J. Khan, et al., 'Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks,' Nature, vol. 7, no. 6, pp, 673-679, June 2001 https://doi.org/10.1038/89044
  6. M. P. S. Brown, et al., 'Support vector machine classification of microarray gene expression data,' USCS-CRL-99-09, pp. 1-23, June 1999
  7. S.-B. Cho, and J.-W. Ryu, 'Classifying gene expression data of cancer using classifier ensemble with mutually exclusive features,' Proc. of the IEEE, vol. 90, no. 11, pp. 1744-1753, 2002 https://doi.org/10.1109/JPROC.2002.804682
  8. J. Quackenbush, 'Computational analysis of microarray data,' Nature Reviews Genetics, vol. 2, pp, 418-427, June 2001 https://doi.org/10.1038/35076576
  9. T. M. Mitchell, Machine Learning, Carnegie Mellon University, 1997
  10. S. Fuhrman, et al., 'The application of Shannon entropy in the identification of putative drug targets,' BioSystems, vol. 55. pp. 5-14, 2000 https://doi.org/10.1016/S0303-2647(99)00077-5
  11. D. Thieffry, et al., 'Qualitative analysis of gene networks,' Pacific Symposium on Biocomputing, vol. 3, pp. 66-76, 1998
  12. D. V. Nguyen, et al., 'Tumor classification by partial least squares using microarray gene expression data,' Bioinformatics, vol. 18, no. 1, pp. 39-50, 2002 https://doi.org/10.1093/bioinformatics/18.1.39
  13. S. Dudoit, et al., 'Comparison of discrimination methods for the classification of tumors using gene expression data,' Technical Report 576, Department of Statistics, University of California, Berkeley, 2000
  14. Y. Xu, et al., 'Artificial neural networks and gene filtering distinguish between global gene expression profiles of Barrett's esophagus and csophageal cancer,' Cancer Research, vol. 62, pp. 3493-3497, 2002
  15. A. Ben-Dor, et, al., 'Tissue classification with gene expression profiles,' Journal of Computational Biology, vol. 7, pp. 559-584, 2000 https://doi.org/10.1089/106652700750050943
  16. R. P. Lippmann, 'Pattern classification using neural networks,' IEEE Communications Magazine, pp. 47-64, November, 1989 https://doi.org/10.1109/35.41401
  17. R. O. Duda, et al., Pattern Classification, 2nd Ed., Wiley Interscience, 2001
  18. T. S. Furey, et al., 'Support vector machine classification and validation of cancer tissue samples using microarray expression data,' Bioinformatics, vol. 16, no. 10, pp. 906-914, 2000 https://doi.org/10.1093/bioinformatics/16.10.906
  19. H.-D. Kim and S.-B. Cho, 'Genetic optimization of structure-adaptive self-organizing map for efficient classification,' Proc. of International Conference on Soft Computing, pp. 34-39, October 2000
  20. A. A. Alizadeh, et al., 'Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling,' Nature, vol. 403, pp. 503-511, February 2000 https://doi.org/10.1038/35000501
  21. U. Alan et al., 'Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays,' Proc. Natl. Acad Sci. USA. vol. 96, pp. 6745-6750, June 1999 https://doi.org/10.1073/pnas.96.12.6745