Estimation of high-dimensional sparse cross correlation matrix

Yin, Cao;Kwangok, Seo;Soohyun, Ahn;Johan, Lim;

doi:10.29220/CSAM.2022.29.6.655

Communications for Statistical Applications and Methods

Volume 29 Issue 6
/
Pages.655-664
/
2022
/
2287-7843(pISSN)
/
2383-4757(eISSN)

The Korean Statistical Society (한국통계학회)

DOI QR Code

Estimation of high-dimensional sparse cross correlation matrix

Yin, Cao (Department of Statistics, Seoul National University) ;
Kwangok, Seo (Department of Statistics, Seoul National University) ;
Soohyun, Ahn (Department of Mathematics, Ajou University) ;
Johan, Lim (Department of Statistics, Seoul National University)

Received : 2022.04.01
Accepted : 2022.07.28
Published : 2022.11.30

https://doi.org/10.29220/CSAM.2022.29.6.655 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

On the motivation by an integrative study of multi-omics data, we are interested in estimating the structure of the sparse cross correlation matrix of two high-dimensional random vectors. We rewrite the problem as a multiple testing problem and propose a new method to estimate the sparse structure of the cross correlation matrix. To do so, we test the correlation coefficients simultaneously and threshold the correlation coefficients by controlling FRD at a predetermined level α. Further, we apply the proposed method and an alternative adaptive thresholding procedure by Cai and Liu (2016) to the integrative analysis of the protein expression data (X) and the mRNA expression data (Y) in TCGA breast cancer cohort. By varying the FDR level α, we show that the new procedure is consistently more efficient in estimating the sparse structure of cross correlation matrix than the alternative one.

Keywords

Acknowledgement

This research was supported by the National Research Foundation of Korea (NRF-2019R1F1A1056779).

References

Benjamini Y and Hochberg Y (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing, Journal of the Royal Statistical Society, Series B, 57, 289-300.
Benjamini Y and Yekutieli D (2001). The control of the false discovery rate in multiple testing under dependency, The Annals of Statistics, 29, 1165-1188. https://doi.org/10.1214/aos/1013699998
Bennett CM, Wolford GL, and Miller MB (2009). The principled control of false positives in neuroimaging, Social Cognitive and Affecive Neuroscience, 4, 417-422. https://doi.org/10.1093/scan/nsp053
Bickel P and Levina E (2008). Covariance regularization by thresholding, The Annals of Statistics, 36, 2577-2604. https://doi.org/10.1214/08-AOS600
Cai T and Liu W (2011). Adaptive thresholding for sparse covariance matrix estimation, Journal of the American Statistical Association, 106, 672-684. https://doi.org/10.1198/jasa.2011.tm10560
Cai T and Liu W (2016). Large-scale multiple testing of correlations, Journal of the American Statistical Association, 111, 229-240. https://doi.org/10.1080/01621459.2014.999157
Cheng J, Kapranov P, Drenkow J, and Dike S (2005). Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution, Science, 308, 1149-1154. https://doi.org/10.1126/science.1108625
Dubois PC, Trynka G, Franke L et al. (2010).Multiple common variants for celiac disease influencing immune gene expression, Nature Genetics, 42, 295-302. https://doi.org/10.1038/ng.543
Efron B and Tibshirani R (2002). Empirical Bayes methods and false discovery rates for microarrays, Genetic Epidemiology, 23, 70-86. https://doi.org/10.1002/gepi.1124
Efron B (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis, Journal of the American Statistical Association, 99, 96-104. https://doi.org/10.1198/016214504000000089
Elliott P and Wartenberg D (2004). Review spatial epidemiology: Current approaches and future challenges, Environmental Health Perspectives, 112, 998-1006. https://doi.org/10.1289/ehp.6735
Fan J, Fan Y, and Lv J (2008). High dimensional covariance matrix estimation using a factor model, Journal of Econometrics, 147, 186-197. https://doi.org/10.1016/j.jeconom.2008.09.017
Fan J, Han X, and Gu W (2012). Estimating false discovery proportion under arbitrary covariance dependence, Journal of the American Statistical Association, 107, 1019-1035. https://doi.org/10.1080/01621459.2012.720478
Han H, Shim H, Shin D, et al. (2015). TRRUST: A reference database of human transcriptional regulatory interactions, Scientific Reports, 5, 11432.
Huttlin EL, Ting L, Bruckner RJ, et al. (2015). The bioplex network: A systematic exploration of the human interactome, Cell, 162, 425-440. https://doi.org/10.1016/j.cell.2015.06.043
Jaeger J, Sengupta R, and Ruzzo WL (2003). Improved gene selection for classification of microarrays, Pacific Symposium on Biocomputing, 8, 53-64.
Liu W (2013). Gaussian graphical model estimation with false discovery rate control, The Annals of Statistics, 41, 2948-2978.
Razick S, Magklaras G, and Donaldson IM (2008). IRefIndex: A consolidated protein interaction database with provenance, BMC Bioinformatics, 9, 405.
Rosato A, Tenori L, Cascante M, De Atauri Carulla PR, Martins Dos Santos VA, and Saccenti E (2018). From correlation to causation: Analysis of metabolomics data using systems biology approaches, Metabolomics, 14, 37.
Shaw P, Greenstein D, Lerch J, et al. (2006). Intellectual ability and cortical development in children and adolescents, Nature, 440, 676-679. https://doi.org/10.1038/nature04513
Shedden K and Taylor J (2005). Differential correlation detects complex associations between gene expression and clinical outcomes in lung adenocarcinomas, Methods of Microarray Data Analysis, (pp. 121-131), Springer, Boston.
Storey JD (2002). A direct approach to false discovery rates, Journal of the Royal Statistical Society, Series B, 64, 479-498. https://doi.org/10.1111/1467-9868.00346
Wang W and Fan J (2017). Asymptotics of empirical eigenstructure for high dimensional spiked covariance, The Annals of Statistics, 45, 1342-1374.
Xia Y, Cai T, and Cai TT (2015). Testing differential networks with applications to detecting geneby-gene interactions, Biometrika, 102, 247-266. https://doi.org/10.1093/biomet/asu074
Yu D, Lee SH, Lim J, Xiao G, Craddock RC, and Biswal BB (2018). Fused lasso regression for identifying differential correlations in brain connectome graphs, Statistical Analysis and Data Mining, 11, 203-226. https://doi.org/10.1002/sam.11382
Zhao F, Xuan Z, Liu L, and Zhang MQ (2005). TRED: A Transcriptional Regulatory Element Database and a platform for in silico gene regulation studies, Nucleic Acids Research, 33, D103-D107. https://doi.org/10.1093/nar/gki004
Zheng G, Tu K, Yang Q, Xiong Y, Wei C, Xie L, Zhu Y, and Li Y (2008). ITFP: An integrated platform of mammalian transcription factors, Bioinformatics, 24, 2416-2417. https://doi.org/10.1093/bioinformatics/btn439

Communications for Statistical Applications and Methods

Estimation of high-dimensional sparse cross correlation matrix

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)