Multiple Group Testing Procedures for Analysis of High-Dimensional Genomic Data

Ko, Hyoseok;Kim, Kipoong;Sun, Hokeun;

doi:10.5808/GI.2016.14.4.187

Genomics & Informatics

제14권4호
/
Pages.187-195
/
2016
/
1598-866X(pISSN)
/
2234-0742(eISSN)

한국유전체학회 (Korea Genome Organization)

DOI QR Code

Multiple Group Testing Procedures for Analysis of High-Dimensional Genomic Data

Ko, Hyoseok (Department of Statistics, Pusan National University) ;
Kim, Kipoong (Department of Statistics, Pusan National University) ;
Sun, Hokeun (Department of Statistics, Pusan National University)

투고 : 2016.07.27
심사 : 2016.10.26
발행 : 2016.12.31

https://doi.org/10.5808/GI.2016.14.4.187 인용 PDF KSCI

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

In genetic association studies with high-dimensional genomic data, multiple group testing procedures are often required in order to identify disease/trait-related genes or genetic regions, where multiple genetic sites or variants are located within the same gene or genetic region. However, statistical testing procedures based on an individual test suffer from multiple testing issues such as the control of family-wise error rate and dependent tests. Moreover, detecting only a few of genes associated with a phenotype outcome among tens of thousands of genes is of main interest in genetic association studies. In this reason regularization procedures, where a phenotype outcome regresses on all genomic markers and then regression coefficients are estimated based on a penalized likelihood, have been considered as a good alternative approach to analysis of high-dimensional genomic data. But, selection performance of regularization procedures has been rarely compared with that of statistical group testing procedures. In this article, we performed extensive simulation studies where commonly used group testing procedures such as principal component analysis, Hotelling's $T^2$ test, and permutation test are compared with group lasso (least absolute selection and shrinkage operator) in terms of true positive selection. Also, we applied all methods considered in simulation studies to identify genes associated with ovarian cancer from over 20,000 genetic sites generated from Illumina Infinium HumanMethylation27K Beadchip. We found a big discrepancy of selected genes between multiple group testing procedures and group lasso.

키워드

참고문헌

Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B Methodol 1995;57:289-300.
Storey JD. A direct approach to false discovery rates. J R Stat Soc Series B Stat Methodol 2002;64:479-498. https://doi.org/10.1111/1467-9868.00346
Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci U S A 2003;100:9440-9445. https://doi.org/10.1073/pnas.1530509100
Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Series B Methodol 1996;58:267-288.
Wu TT, Chen YF, Hastie T, Sobel E, Lange K. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 2009;25:714-721. https://doi.org/10.1093/bioinformatics/btp041
Alexander DH, Lange K. Stability selection for genome-wide association. Genet Epidemiol 2011;35:722-728. https://doi.org/10.1002/gepi.20623
Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc Series B Stat Methodol 2006;68:49-67. https://doi.org/10.1111/j.1467-9868.2005.00532.x
Ma S, Song X, Huang J. Supervised group Lasso with applications to microarray data analysis. BMC Bioinformatics 2007;8:60. https://doi.org/10.1186/1471-2105-8-60
Jolliffe IT. Springer Series in Statistics. Principal Component Analysis. New York: Springer-Verlag, 2002.
Chen M, Cho J, Zhao H. Incorporating biological pathways via a Markov random field model in genome-wide association studies. PLoS Genet 2011;7:e1001353. https://doi.org/10.1371/journal.pgen.1001353
Lee S, Epstein MP, Duncan R, Lin X. Sparse principal component analysis for identifying ancestry-informative markers in genome-wide association studies. Genet Epidemiol 2012;36:293-302. https://doi.org/10.1002/gepi.21621
Sun H, Wang S. Network-based regularization for matched case-control analysis of high-dimensional DNA methylation data. Stat Med 2013;32:2127-2139. https://doi.org/10.1002/sim.5694
Lu Y, Liu PY, Xiao P, Deng HW. Hotelling's $T^2$ multivariate profiling for detecting differential expression in microarrays. Bioinformatics 2005;21:3105-3113. https://doi.org/10.1093/bioinformatics/bti496
Kong SW, Pu WT, Park PJ. A multivariate approach for integrating genome-wide expression data and biological knowledge. Bioinformatics 2006;22:2373-2380. https://doi.org/10.1093/bioinformatics/btl401
Cheung YH, Wang G, Leal SM, Wang S. A fast and noise-resilient approach to detect rare-variant associations with deep sequencing data for complex disorders. Genet Epidemiol 2012;36:675-685. https://doi.org/10.1002/gepi.21662
Park H, Niida A, Miyano S, Imoto S. Sparse overlapping group lasso for integrative multi-omics analysis. J Comput Biol 2015; 22:73-84. https://doi.org/10.1089/cmb.2014.0197
Whittaker J. Graphical Models in Applied Multivariate Statistics. New York: John Wiley & Sons, 1990.
Peng J, Wang P, Zhou N, Zhu J. Partial correlation estimation by joint sparse regression models. J Am Stat Assoc 2009;104:735-746. https://doi.org/10.1198/jasa.2009.0126
Teschendorff AE, Menon U, Gentry-Maharaj A, Ramus SJ, Weisenberger DJ, Shen H, et al. Age-dependent DNA methylation of genes that are suppressed in stem cells is a hallmark of cancer. Genome Res 2010;20:440-446. https://doi.org/10.1101/gr.103606.109
Sun H, Wang S. Penalized logistic regression for high-dimensional DNA methylation data with case-control studies. Bioinformatics 2012;28:1368-1375. https://doi.org/10.1093/bioinformatics/bts145
Chen Y, Ning Y, Hong C, Wang S. Semiparametric tests for identifying differentially methylated loci with case-control designs using Illumina arrays. Genet Epidemiol 2014;38:42-50. https://doi.org/10.1002/gepi.21774

Genomics & Informatics

Multiple Group Testing Procedures for Analysis of High-Dimensional Genomic Data

초록

키워드

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)