Browse > Article
http://dx.doi.org/10.5808/GI.2019.17.4.e41

Review of statistical methods for survival analysis using genomic data  

Lee, Seungyeoun (Department of Mathematics and Statistics, Sejong University)
Lim, Heeju (Department of Statistics, University of Connecticut)
Abstract
Survival analysis mainly deals with the time to event, including death, onset of disease, and bankruptcy. The common characteristic of survival analysis is that it contains "censored" data, in which the time to event cannot be completely observed, but instead represents the lower bound of the time to event. Only the occurrence of either time to event or censoring time is observed. Many traditional statistical methods have been effectively used for analyzing survival data with censored observations. However, with the development of high-throughput technologies for producing "omics" data, more advanced statistical methods, such as regularization, should be required to construct the predictive survival model with high-dimensional genomic data. Furthermore, machine learning approaches have been adapted for survival analysis, to fit nonlinear and complex interaction effects between predictors, and achieve more accurate prediction of individual survival probability. Presently, since most clinicians and medical researchers can easily assess statistical programs for analyzing survival data, a review article is helpful for understanding statistical methods used in survival analysis. We review traditional survival methods and regularization methods, with various penalty functions, for the analysis of high-dimensional genomics, and describe machine learning techniques that have been adapted to survival analysis.
Keywords
censoring; Cox model; Kaplan-Meier curve; machine learning; regularization; survival time;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Ishwaran H, Kogalur UB, Chen X, Minn AJ. Random survival forests for high-dimensional data. Stat Anal Data Min 2011;4:115-132.   DOI
2 Freund Y. Boosting a weak learning algorithm by majority. Inf Comput 1995;121:256-285.   DOI
3 Friedman J, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann Stat 2000;28:337-407.   DOI
4 De Bin R. Boosting in Cox Regression: A Comparison between the Likelihood-Based and the Model-Based Approaches with Focus on the R-Packages CoxBoost and mboost. Technical Report No. 180. Munich: Department of Statistics, University of Munich, 2015.
5 Buhlmann P, Yu B. Boosting with the L2 loss: regression and classification. J Am Stat Assoc 2003;98:324-339.   DOI
6 Schmid M, Hothorn T. Flexible boosting of accelerated failure time models. BMC Bioinformatics 2008;9:269.   DOI
7 Lee DKK, Chen N, Ishwaran H. Boosted nonparametric hazards with time-dependent covariates. Ithaca: arXiv, Corrnell University, 2017. Accessed 2019 Sep 10. Available from: https://arxiv.org/abs/1701.07926.
8 Floyd CE Jr, Lo JY, Yun AJ, Sullivan DC, Kornguth PJ. Prediction of breast cancer malignancy using an artificial neural network. Cancer 1994;74:2944-2948.   DOI
9 Burnside ES. Breast cancer risk estimation with artificial neural networks revisited: discrimination and calibration. Cancer 2010;116:3310-3321.   DOI
10 Faraggi D, Simon R. A neural network model for survival data. Stat Med 1995;14:73-82.   DOI
11 Binder H, Schumacher M. Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinformatics 2008;9:14.   DOI
12 Petalidis LP, Oulas A, Backlund M, Wayland MT, Liu L, Plant K, et al. Improved grading and survival prediction of human astrocytic brain tumors by artificial neural network analysis of gene expression microarray data. Mol Cancer Ther 2008;7:1013-1024.   DOI
13 Chi CL, Street WN, Wolberg WH. Application of artificial neural network-based survival analysis on two breast cancer datasets. AMIA Annu Symp Proc 2007;2007:130-134.
14 Ching T, Zhu X, Garmire LX. Cox-nnet: an artificial neural network method for prognosis prediction of high-throughput omics data. PLoS Comput Biol 2018;14:e1006076.   DOI
15 Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 2014;15:550.   DOI
16 Yu CN, Greiner R, Lin HC, Baracos V. Learning patient-specific cancer survival distributions as a sequence of dependent regressors. In: Proceedings of the 24th International Conference on Neural Information Processing Systems (NIPS 2011); 2011 Dec 12-15; Granada, Spain. Red Hook: Curran Associates Inc, 2011. pp. 1845-1853.
17 Fotso S. Deep neural networks for survival analysis based on a multi-task framework. Ithaca: arXiv, Corrnell University, 2017. Accessed 2019 Sep 10. Available from: https://arxiv.org/abs/1801.05512.
18 Cancer Genome Atlas Research Network, Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA, et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet 2013;45:1113-1120.   DOI
19 Grimes T, Walker AR, Datta S, Datta S. Predicting survival times for neuroblastoma patients using RNA-seq expression profiles. Biol Direct 2018;13:11.   DOI
20 Harrell FE Jr, Califf RM, Pryor DB, Lee KL, Rosati RA. Evaluating the yield of medical tests. JAMA 1982;247:2543-2546.   DOI
21 Peto R, Peto J. Asymptotically efficient rank invariant test procedures. J R Stat Soc Series A 1972;135:185-206.   DOI
22 Klein JP, Moeschberger ML. Survival Analysis: Techniques for Censored and Truncated Data. 2nd ed. New York: Springer, 2010.
23 Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. 2nd ed. New York: John Wiley and Sons, 2011.
24 Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc 1958;53:457-481.   DOI
25 Cox DR. Regression models and life-tables. J R Stat Soc Series B Methodol 1972;34:187-220.
26 Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 1998;95:14863-14868.   DOI
27 Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000;403:503-511.   DOI
28 Dunn OJ. Multiple comparisons among means. J Am Stat Assoc 1961;56:52-64.   DOI
29 Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B Methodol 1995;57:289-300.
30 Tibshirani R. The lasso method for variable selection in the Cox model. Stat Med 1997;16:385-395.   DOI
31 Efron B. The two sample problem with censored data. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 4 (Le Cam LM, Neyman J, eds.); 1954 Jun 21-Jul 18; Berkeley, CA, USA. New York: Prentice-Hall, 1967. pp. 831-853.
32 Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Series B Stat Methodol 2005;67:301-320.   DOI
33 International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 2004;431:931-945.   DOI
34 Breslow N, Crowley J. A large sample study of the life table and product limit estimates under random censorship. Ann Stat 1974;2:437-453.   DOI
35 Breslow NE. Analysis of survival data under the proportional hazards model. Int Stat Rev 1975;43:45-57.   DOI
36 Mantel N, Bohidar NR, Ciminera JL. Mantel-Haenszel analyses of litter-matched time-to-response data, with modifications for recovery of interlitter information. Cancer Res 1977;37:3863-3868.
37 Schumacher M. Two-sample tests of Cramer-Von Mises- and Kolmogorov-Smirnov-type for randomly censored data. Int Stat Rev 1984;52:263-281.   DOI
38 Brookmeyer R, Crowley J. A k-sample median test for censored data. J Am Stat Assoc 1982;77:433-440.   DOI
39 Chen Z, Zhang G. Comparing survival curves based on medians. BMC Med Res Methodol 2016;16:33.   DOI
40 Hoerl A, Kennard R. Ridge regression. Encycl Stat Sci 2006;8:129-136.
41 Cox DR. Partial likelihood. Biometrika 1975;62:269-276.   DOI
42 Aalen OO. A linear regression model for the analysis of life times. Stat Med 1989;8:907-925.   DOI
43 Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference and Prediction. New York: Springer, 2001.
44 Lin DY, Ying Z. Semiparametric analysis of the additive risk model. Biometrika 1994;81:61-71.   DOI
45 Zhang HH, Lu W. Adaptive Lasso for Cox's proportional hazards model. Biometrika 2007;94:691-703.   DOI
46 Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and smoothness via the fused lasso. J R Stat Soc Series B Stat Methodol 2005;67:91-108.   DOI
47 Tabassum MN, Ollila E. Pathwise least angle regression and a significance test for the elastic net. In: Proceedings of the 25th European Signal Processing Conference (EUSIPCO); 2017 Aug 28-Sep 2; Kos, Greece. Piscataway: Institute of Electrical and Electronics Engineers, 2017.
48 Bishop CM. Pattern Recognition and Machine Learning. New York: Springer, 2006.
49 James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning with Application in R. New York: Springer, 2017.
50 Geron A. Hands-on Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. Sebastopol: O'Reilly, 2017.
51 Brieman L, Friedman J, Stone CJ, Olshen RA. Classification and Regression Trees. Belmont: Taylor & Francis, 1984.
52 Hastie T, Tibshirani R, Wainwright M. Statistical Learning with Sparsity: The Lasso and Generalizations. Boca Raton: CRC Press, 2016.
53 Jansche M, Shivaswamy PK, Chu W. A support vector approach to censored targets. In: 7th IEEE International Conference on Data Mining (ICDM 2007), Vol. 1 (Ramakrishnan N, ZaTane OR, Shi Y, Clifton CW, Wu X, eds.); 2007 Oct 28-31; Omaha, NE, USA. Piscataway: Institute of Electrical and Electronics Engineers, 2007. pp. 655-660
54 Ciampi A, Thiffault J, Nakache JP, Asselain B. Stratification by stepwise regression, correspondence analysis and recursive partition: a comparison of three methods of analysis for survival data with covariates. Comput Stat Data Anal 1986;4:185-204.   DOI
55 Leblanc M, Crowley J. Survival trees by goodness of split. J Am Stat Assoc 1993;88:457-467.   DOI
56 Calhoun P, Su X, Nunn M, Fan J. Constructing multivariate survival trees: the MST package for R. J Stat Softw 2018 Feb [Epub]. https://doi.org/10.18637/jss.v083.i12.
57 Su X, Fan J. Multivariate survival trees: a maximum likelihood approach based on frailty models. Biometrics 2004;60:93-99.   DOI
58 Fan J, Nunn ME, Su X. Multivariate exponential survival trees and their application to tooth prognosis. Comput Stat Data Anal 2009;53:1110-1121.   DOI
59 Khan FM, Zubek VB. Support vector regression for censored data (SVRc): a novel tool for survival analysis. In: 2008 Eighth IEEE International Conference on Data Mining; 2008 Dec 15-19, Pisa, Italy. Piscataway: Institute of Electrical and Electronics Engineers, 2008. pp. 863-868.
60 Hothorn T, Lausen B, Benner A, Radespiel-Troger M. Bagging survival trees. Stat Med 2004;23:77-91.   DOI