DOI QR코드

DOI QR Code

Comparison of methods for the proportion of true null hypotheses in microarray studies

  • Kang, Joonsung (Department of Information Statistics, Gangneung-Wonju National University)
  • Received : 2019.10.18
  • Accepted : 2019.11.28
  • Published : 2020.01.31

Abstract

We consider estimating the proportion of true null hypotheses in multiple testing problems. A traditional multiple testing rate, family-wise error rate is too conservative and old to control type I error in multiple testing setups; however, false discovery rate (FDR) has received significant attention in many research areas such as GWAS data, FMRI data, and signal processing. Identify differentially expressed genes in microarray studies involves estimating the proportion of true null hypotheses in FDR procedures. However, we need to account for unknown dependence structures among genes in microarray data in order to estimate the proportion of true null hypothesis since the genuine dependence structure of microarray data is unknown. We compare various procedures in simulation data and real microarray data. We consider a hidden Markov model for simulated data with dependency. Cai procedure (2007) and a sliding linear model procedure (2011) have a relatively smaller bias and standard errors, being more proper for estimating the proportion of true null hypotheses in simulated data under various setups. Real data analysis shows that 5 estimation procedures among 9 procedures have almost similar values of the estimated proportion of true null hypotheses in microarray data.

Keywords

References

  1. Baldi P and Hatfield W (2002). DNA Microarrays and Gene Expression, Cambridge University Press, Cambridge.
  2. Benjamini Y and Hochberg Y (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal Statistical Society: Series B, 57, 289-300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x
  3. Benjamini Y and Hochberg Y (2000). On the adaptive control of the false discovery rate in multiple testing with independent Statistics, Journal of Educational and Behavioral Statistics, 25, 60-83. https://doi.org/10.3102/10769986025001060
  4. Benjamini Y and Yekutieli D (2001). The control of the false discovery rate in multiple testing under dependency, Annals of Statistics, 29, 1165-1188. https://doi.org/10.1214/aos/1013699998
  5. Churchill G (1992). Hidden Markov chains and the analysis of genome structure, Computers and Chemistry, 16, 107-115. https://doi.org/10.1016/0097-8485(92)80037-Z
  6. Ephraim Y and Merhav N (2002). Hidden Markov processes, IEEE Transactions on Information Theory, 48, 1518-1569. https://doi.org/10.1109/TIT.2002.1003838
  7. Jiang H and Doerge RW (2008). Estimating the proportion of true null hypotheses for multiple comparisons, Cancer Informatics, 6, 25-32.
  8. Jin J and Cai TT (2007). Estimating the null and the proportion of non-null effects in large-scale multiple comparisons, Journal of the American Statistical Association, 102, 495-506. https://doi.org/10.1198/016214507000000167
  9. Krogh A, Brown M, Mian I, Sjolander K, and Haussler D (1994). Hidden Markov models in computational biology. Applications to protein modeling, Journal of Molecular Biology, 235, 1501-1531. https://doi.org/10.1006/jmbi.1994.1104
  10. Langaas M, Lindqvist BH, and Ferkingstad E (2005). Estimating the proportion of true null hypotheses, with application to DNA microarray data, Journal of the Royal Statistical Society: Series B, 67, 555-572. https://doi.org/10.1111/j.1467-9868.2005.00515.x
  11. Nettleton D, Hwang JTG, Caldo RA, and Wise RP (2006). Estimating the number of true null hypotheses from a histogram of p values, Journal of Agricultural, Biological, and Environmental Statistics, 11, 337-356. https://doi.org/10.1198/108571106X129135
  12. Pounds S and Cheng C (2006). Robust estimation of the false discovery rate, Bioinformatics, 22, 1979-1987. https://doi.org/10.1093/bioinformatics/btl328
  13. Rabiner L (1989). A tutorial on hidden Markov models and selected applications in speech recognition, IEEE, 77, 257-286. https://doi.org/10.1109/5.18626
  14. Speed T (2003). Statistical Analysis of Gene Expression Microarray Data, Chapman and Hall/CRC, New York.
  15. Storey JD (2002). A direct approach to false discovery rates, Journal of the Royal Statistical Society: Series B, 64, 479-498. https://doi.org/10.1111/1467-9868.00346
  16. Storey JD, Taylor JE, and Siegmund D (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach, Journal of the Royal Statistical Society: Series B, 66, 187-205. https://doi.org/10.1111/j.1467-9868.2004.00439.x
  17. Storey JD and Tibshirani R (2003). Statistical significance for genomewide studies. In Proceedings of the National Academy of Sciences, 100, 9440-9445. https://doi.org/10.1073/pnas.1530509100
  18. Sun W and Cai TT (2009). Large-scale multiple testing under dependence, Journal of the Royal Statistical Society: Series B, 71, 393-424. https://doi.org/10.1111/j.1467-9868.2008.00694.x
  19. Van't Wout AB, Lehrman GK, Mikheeva SA, O'Keeffe GC, Katze MG, Bumgarner RE, Geiss GK, and Mullins JI (2003). Cellular gene expression upon human immunodeficiency virus type 1 infection of CD4(+)-T-cell lines, Journal of Virology, 77, 1392-1402. https://doi.org/10.1128/JVI.77.2.1392-1402.2003
  20. Wang HQ, Tuominen LK, and Tsai CJ (2011). SLIM: a sliding linear model for estimating the proportion of true null hypotheses in datasets with dependence structures, Bioinformatics, 27, 225-231. https://doi.org/10.1093/bioinformatics/btq650