DOI QR코드

DOI QR Code

Estimation of high-dimensional sparse cross correlation matrix

  • Yin, Cao (Department of Statistics, Seoul National University) ;
  • Kwangok, Seo (Department of Statistics, Seoul National University) ;
  • Soohyun, Ahn (Department of Mathematics, Ajou University) ;
  • Johan, Lim (Department of Statistics, Seoul National University)
  • Received : 2022.04.01
  • Accepted : 2022.07.28
  • Published : 2022.11.30

Abstract

On the motivation by an integrative study of multi-omics data, we are interested in estimating the structure of the sparse cross correlation matrix of two high-dimensional random vectors. We rewrite the problem as a multiple testing problem and propose a new method to estimate the sparse structure of the cross correlation matrix. To do so, we test the correlation coefficients simultaneously and threshold the correlation coefficients by controlling FRD at a predetermined level α. Further, we apply the proposed method and an alternative adaptive thresholding procedure by Cai and Liu (2016) to the integrative analysis of the protein expression data (X) and the mRNA expression data (Y) in TCGA breast cancer cohort. By varying the FDR level α, we show that the new procedure is consistently more efficient in estimating the sparse structure of cross correlation matrix than the alternative one.

Keywords

Acknowledgement

This research was supported by the National Research Foundation of Korea (NRF-2019R1F1A1056779).

References

  1. Benjamini Y and Hochberg Y (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing, Journal of the Royal Statistical Society, Series B, 57, 289-300.
  2. Benjamini Y and Yekutieli D (2001). The control of the false discovery rate in multiple testing under dependency, The Annals of Statistics, 29, 1165-1188. https://doi.org/10.1214/aos/1013699998
  3. Bennett CM, Wolford GL, and Miller MB (2009). The principled control of false positives in neuroimaging, Social Cognitive and Affecive Neuroscience, 4, 417-422. https://doi.org/10.1093/scan/nsp053
  4. Bickel P and Levina E (2008). Covariance regularization by thresholding, The Annals of Statistics, 36, 2577-2604. https://doi.org/10.1214/08-AOS600
  5. Cai T and Liu W (2011). Adaptive thresholding for sparse covariance matrix estimation, Journal of the American Statistical Association, 106, 672-684. https://doi.org/10.1198/jasa.2011.tm10560
  6. Cai T and Liu W (2016). Large-scale multiple testing of correlations, Journal of the American Statistical Association, 111, 229-240. https://doi.org/10.1080/01621459.2014.999157
  7. Cheng J, Kapranov P, Drenkow J, and Dike S (2005). Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution, Science, 308, 1149-1154. https://doi.org/10.1126/science.1108625
  8. Dubois PC, Trynka G, Franke L et al. (2010).Multiple common variants for celiac disease influencing immune gene expression, Nature Genetics, 42, 295-302. https://doi.org/10.1038/ng.543
  9. Efron B and Tibshirani R (2002). Empirical Bayes methods and false discovery rates for microarrays, Genetic Epidemiology, 23, 70-86. https://doi.org/10.1002/gepi.1124
  10. Efron B (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis, Journal of the American Statistical Association, 99, 96-104. https://doi.org/10.1198/016214504000000089
  11. Elliott P and Wartenberg D (2004). Review spatial epidemiology: Current approaches and future challenges, Environmental Health Perspectives, 112, 998-1006. https://doi.org/10.1289/ehp.6735
  12. Fan J, Fan Y, and Lv J (2008). High dimensional covariance matrix estimation using a factor model, Journal of Econometrics, 147, 186-197. https://doi.org/10.1016/j.jeconom.2008.09.017
  13. Fan J, Han X, and Gu W (2012). Estimating false discovery proportion under arbitrary covariance dependence, Journal of the American Statistical Association, 107, 1019-1035. https://doi.org/10.1080/01621459.2012.720478
  14. Han H, Shim H, Shin D, et al. (2015). TRRUST: A reference database of human transcriptional regulatory interactions, Scientific Reports, 5, 11432.
  15. Huttlin EL, Ting L, Bruckner RJ, et al. (2015). The bioplex network: A systematic exploration of the human interactome, Cell, 162, 425-440. https://doi.org/10.1016/j.cell.2015.06.043
  16. Jaeger J, Sengupta R, and Ruzzo WL (2003). Improved gene selection for classification of microarrays, Pacific Symposium on Biocomputing, 8, 53-64.
  17. Liu W (2013). Gaussian graphical model estimation with false discovery rate control, The Annals of Statistics, 41, 2948-2978.
  18. Razick S, Magklaras G, and Donaldson IM (2008). IRefIndex: A consolidated protein interaction database with provenance, BMC Bioinformatics, 9, 405.
  19. Rosato A, Tenori L, Cascante M, De Atauri Carulla PR, Martins Dos Santos VA, and Saccenti E (2018). From correlation to causation: Analysis of metabolomics data using systems biology approaches, Metabolomics, 14, 37.
  20. Shaw P, Greenstein D, Lerch J, et al. (2006). Intellectual ability and cortical development in children and adolescents, Nature, 440, 676-679. https://doi.org/10.1038/nature04513
  21. Shedden K and Taylor J (2005). Differential correlation detects complex associations between gene expression and clinical outcomes in lung adenocarcinomas, Methods of Microarray Data Analysis, (pp. 121-131), Springer, Boston.
  22. Storey JD (2002). A direct approach to false discovery rates, Journal of the Royal Statistical Society, Series B, 64, 479-498. https://doi.org/10.1111/1467-9868.00346
  23. Wang W and Fan J (2017). Asymptotics of empirical eigenstructure for high dimensional spiked covariance, The Annals of Statistics, 45, 1342-1374.
  24. Xia Y, Cai T, and Cai TT (2015). Testing differential networks with applications to detecting geneby-gene interactions, Biometrika, 102, 247-266. https://doi.org/10.1093/biomet/asu074
  25. Yu D, Lee SH, Lim J, Xiao G, Craddock RC, and Biswal BB (2018). Fused lasso regression for identifying differential correlations in brain connectome graphs, Statistical Analysis and Data Mining, 11, 203-226. https://doi.org/10.1002/sam.11382
  26. Zhao F, Xuan Z, Liu L, and Zhang MQ (2005). TRED: A Transcriptional Regulatory Element Database and a platform for in silico gene regulation studies, Nucleic Acids Research, 33, D103-D107. https://doi.org/10.1093/nar/gki004
  27. Zheng G, Tu K, Yang Q, Xiong Y, Wei C, Xie L, Zhu Y, and Li Y (2008). ITFP: An integrated platform of mammalian transcription factors, Bioinformatics, 24, 2416-2417. https://doi.org/10.1093/bioinformatics/btn439