DOI QR코드

DOI QR Code

안정적 유전자 특징 선택을 위한 유전자 발현량 데이터의 부트스트랩 기반 Lasso 회귀 분석

Lasso Regression of RNA-Seq Data based on Bootstrapping for Robust Feature Selection

  • 조정희 (서울대학교 협동과정 생물정보학) ;
  • 윤성로 (서울대학교 전기정보공학부)
  • 투고 : 2017.04.21
  • 심사 : 2017.06.29
  • 발행 : 2017.09.15

초록

많은 수의 유전자 데이터를 이용해서 Lasso 회귀 분석을 할 때, 유전자 발현량 값들 사이의 높은 상관성으로 인하여 회귀 계수의 추정값이 회귀 분석의 반복 시행마다 달라질 수 있다. L1 정규화에 의해 축소되는 회귀 계수의 불안정성은 변수 선택을 어렵게 하는 요인이 된다. 본 연구에서는 이러한 문제를 해결하기 위하여 부트스트랩 단계를 반복 시행하여 높은 빈도로 선택된 유전자들을 이용한 회귀 모형들을 만들고, 각 모형들에서 안정적으로 선택되는 특징 유전자들을 찾고, 그 유전자들이 위양성 결과가 아님을 입증하였다. 또한, 회귀모형 별 예측지수의 정확도를 실제지수와의 상관관계를 이용해 측정하였는데, 선택된 특징 유전자들의 회귀계수 부호의 분포가 정확도와 관련성을 보임을 확인하였다.

When large-scale gene expression data are analyzed using lasso regression, the estimation of regression coefficients may be unstable due to the highly correlated expression values between associated genes. This irregularity, in which the coefficients are reduced by L1 regularization, causes difficulty in variable selection. To address this problem, we propose a regression model which exploits the repetitive bootstrapping of gene expression values prior to lasso regression. The genes selected with high frequency were used to build each regression model. Our experimental results show that several genes were consistently selected in all regression models and we verified that these genes were not false positives. We also identified that the sign distribution of the regression coefficients of the selected genes from each model was correlated to the real dependent variables.

키워드

과제정보

연구 과제 주관 기관 : 미래창조과학부, 보건복지부

참고문헌

  1. Emilsson, Valur, et al., "Genetics of gene expression and its effect on disease," Nature, Vol. 452, No. 7186, pp. 423-428, 2008. https://doi.org/10.1038/nature06758
  2. Tibshirani, Robert, "Regression shrinkage and selection via the lasso," Journal of the Royal Statistical Society. Series B (Methodological), pp. 267- 288, 1996.
  3. Park, Cheolyong, "Simple Principal component analysis using Lasso," Journal of the Korean Data and Information Science Society, Vol. 24, No. 3, pp. 533- 541, 2013. (in Korean) https://doi.org/10.7465/jkdi.2013.24.3.533
  4. Bach, Francis R., "Bolasso: model consistent lasso estimation through the bootstrap," Proc. of the 25th International Conference on Machine Learning, ACM, pp. 33-40, 2008.
  5. Zou, Hui, "The adaptive lasso and its oracle properties," Journal of the American Statistical Association, Vol. 101, No. 476, pp. 1418-1429, 2006. https://doi.org/10.1198/016214506000000735
  6. Nam, R. K., et al., "Expression of the TMPRSS2: ERG fusion gene predicts cancer recurrence after surgery for localised prostate cancer," British Journal of Cancer, Vol. 97, No. 12, pp. 1690-1695, 2007. https://doi.org/10.1038/sj.bjc.6604054
  7. Wang, Yixin, et al., "Gene expression profiles and molecular markers to predict recurrence of Dukes' B colon cancer," Journal of Clinical Oncology, Vol. 22, No. 9, pp. 1564-1571, 2004. https://doi.org/10.1200/jco.2004.22.14_suppl.1564
  8. Beer, D. G. et al., "Gene-expression profiles predict survival of patients with lung adenocarcinoma," Nature Medicine, Vol. 8, No. 8, pp. 816-824, 2002. https://doi.org/10.1038/nm733
  9. Efron, Bradley, "Bootstrap Methods: Another Look at the Jackknife," The Annals of Statistics, pp. 1-26, 1979.
  10. Ader, Herman J., and Mellenbergh Ader, "Advising on research methods: A consultant's companion," Johannes van Kessel Publishing, pp. 371-373, 2008.
  11. Paik, S. et al., "A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer," New England Journal of Medicine, Vol. 351, No. 27, pp. 2817-2826, 2004. https://doi.org/10.1056/NEJMoa041588
  12. Robinson, Mark D., Davis J. McCarthy, and Gordon K. Smyth, "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data," Bioinformatics, Vol. 26, No. 1, pp. 139-140, 2010. https://doi.org/10.1093/bioinformatics/btp616
  13. Cheng, J. et al., "Good practice guidelines for biomarker discovery from array data: a case study for breast cancer prognosis," BMC Systems Biology, Vol. 7, No. 4, p. S2, 2013.
  14. Aguiar, Ernestina Silva de, et al., "GSTM1, GSTT1, and GSTP1 polymorphisms, breast cancer risk factors and mammographic density in women submitted to breast cancer screening," Revista Brasileira de Epidemiologia, Vol. 15, No. 2, pp. 246-255, 2012. https://doi.org/10.1590/S1415-790X2012000200002
  15. Mohammed, Hisham, et al., "Progesterone receptor modulates ER [agr] action in breast cancer," Nature, Vol. 523, No. 7560, pp. 313-317, 2015. https://doi.org/10.1038/nature14583
  16. Cheng, Chien-Jui, et al., "SCUBE2 suppresses breast tumor cell proliferation and confers a favorable prognosis in invasive breast cancer," Cancer Research, Vol. 69, No. 8, pp. 3634-3641, 2009. https://doi.org/10.1158/0008-5472.CAN-08-3615
  17. Majidzadeh-A, Keivan, Rezvan Esmaeili, and Nasrin Abdoli, "TFRC and ACTB as the best reference genes to quantify Urokinase Plasminogen Activator in breast cancer," BMC Research Notes, Vol. 4, No. 1, p. 215, 2011. https://doi.org/10.1186/1756-0500-4-215
  18. Dawson, Sarah-Jane, et al., "BCL2 in breast cancer: a favourable prognostic marker across molecular subtypes and independent of adjuvant therapy received," British Journal of Cancer, Vol. 103, No. 5, pp. 668-675, 2010. https://doi.org/10.1038/sj.bjc.6605736
  19. Robinson, Dan R., et al., "Activating ESR1 mutations in hormone-resistant metastatic breast cancer," Nature Genetics, Vol. 45, No. 12, pp. 1446-1451, 2013. https://doi.org/10.1038/ng.2823
  20. Zhang, Zhan-Guo, et al., "MiR-132 prohibits proliferation, invasion, migration, and metastasis in breast cancer by targeting HN1," Biochemical and Biophysical Research Communications, Vol. 454, No. 1, pp. 109-114, 2014. https://doi.org/10.1016/j.bbrc.2014.10.049
  21. Cheng, Chun-Wen, et al., "The clinical implications of MMP-11 and CK-20 expression in human breast cancer," Clinica Chimica Acta, Vol. 411, No. 3, pp. 234-241, 2010. https://doi.org/10.1016/j.cca.2009.11.009
  22. Turner, Bruce C., et al., "BAG-1: a novel biomarker predicting long-term survival in early-stage breast cancer," Journal of Clinical Oncology, Vol. 19, No. 4, pp. 992-1000, 2001. https://doi.org/10.1200/JCO.2001.19.4.992
  23. Menzl, Ina, et al., "Loss of primary cilia occurs early in breast cancer development," Cilia, Vol. 3, No. 1, p. 7, 2014. https://doi.org/10.1186/2046-2530-3-7
  24. Gevaert, O., Smet, F. D., Timmerman, D., Moreau, Y., and Moor, B. D., "Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks," Bioinformatics, Vol. 22, No. 14, e184-190, 2006. https://doi.org/10.1093/bioinformatics/btl230
  25. Fukukawa, C., Ueda, K., Nishidate, T., Katagiri, T., and Nakamura, Y., "Critical roles of LGN/GPSM2 phosphorylation by PBK/TOPK in cell division of breast cancer cells," Genes, Chromosomes and Cancer, Vol. 49, No. 10, pp. 861-872, 2010. https://doi.org/10.1002/gcc.20795