Browse > Article
http://dx.doi.org/10.5808/gi.22074

Overview of frequent pattern mining  

Jurg Ott (Laboratory of Statistical Genetics, Rockefeller University)
Taesung Park (Department of Statistics, Seoul National University)
Abstract
Various methods of frequent pattern mining have been applied to genetic problems, specifically, to the combined association of two genotypes (a genotype pattern, or diplotype) at different DNA variants with disease. These methods have the ability to come up with a selection of genotype patterns that are more common in affected than unaffected individuals, and the assessment of statistical significance for these selected patterns poses some unique problems, which are briefly outlined here.
Keywords
data mining; genotype pattern; machine learning; pattern recognition; statistical significance;
Citations & Related Records
Times Cited By KSCI : 11  (Citation Analysis)
연도 인용수 순위
1 Pfaffelhuber P, Grundner-Culemann F, Lipphardt V, Baumdicker F. How to choose sets of ancestry informative markers: a supervised feature selection approach. Forensic Sci Int Genet 2020;46:102259.
2 Rasmussen M, Sikora M, Albrechtsen A, Korneliussen TS, Moreno-Mayar JV, Poznik GD, et al. The ancestry and affiliations of Kennewick Man. Nature 2015;523:455-458.   DOI
3 Moltke I, Korneliussen TS, Seguin-Orlando A, Moreno-Mayar JV, LaPointe E, Billeck W, et al. Identifying a living great-grandson of the Lakota Sioux leader Tatanka Iyotake (Sitting Bull). Sci Adv 2021;7:eabh2013.
4 Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, et al. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet 2001;69:138-147.   DOI
5 Velez DR, White BC, Motsinger AA, Bush WS, Ritchie MD, Williams SM, et al. A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genet Epidemiol 2007;31:306-315.   DOI
6 Hahn LW, Ritchie MD, Moore JH. Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics 2003;19:376-382.   DOI
7 Chung Y, Lee SY, Elston RC, Park T. Odds ratio based multifactor-dimensionality reduction method for detecting gene-gene interactions. Bioinformatics 2007;23:71-76.   DOI
8 Lee SY, Chung Y, Elston RC, Kim Y, Park T. Log-linear model-based multifactor dimensionality reduction method to detect gene gene interactions. Bioinformatics 2007;23:2589-2595.   DOI
9 Gui J, Andrew AS, Andrews P, Nelson HM, Kelsey KT, Karagas MR, et al. A robust multifactor dimensionality reduction method for detecting gene-gene interactions with application to the genetic analysis of bladder cancer susceptibility. Ann Hum Genet 2011;75:20-28.   DOI
10 Hua X, Zhang H, Zhang H, Yang Y, Kuk AY. Testing multiple gene interactions by the ordered combinatorial partitioning method in case-control studies. Bioinformatics 2010;26:1871-1878.   DOI
11 Lou XY, Chen GB, Yan L, Ma JZ, Zhu J, Elston RC, et al. A generalized combinatorial approach for detecting gene-by-gene and gene-by-environment interactions with application to nicotine dependence. Am J Hum Genet 2007;80:1125-1137.   DOI
12 Gui J, Moore JH, Williams SM, Andrews P, Hillege HL, van der Harst P, et al. A simple and computationally efficient approach to multifactor dimensionality reduction analysis of gene-gene interactions for quantitative traits. PLoS One 2013;8:e66545.
13 Lee Y, Kim H, Park T, Park M. Gene-gene interaction analysis for quantitative trait using cluster-based multifactor dimensionality reduction method. Int J Data Min Bioinform 2018;20:1-11.   DOI
14 Gui J, Moore JH, Kelsey KT, Marsit CJ, Karagas MR, Andrew AS. A novel survival multifactor dimensionality reduction method for detecting gene-gene interactions with application to bladder cancer prognosis. Hum Genet 2011;129:101-110.   DOI
15 Lee S, Kwon MS, Oh JM, Park T. Gene-gene interaction analysis for the survival phenotype based on the Cox model. Bioinformatics 2012;28:i582-i588.   DOI
16 Oh JS, Lee SY. An extension of multifactor dimensionality reduction method for detecting gene-gene interactions with the survival time. J Korean Data Inf Sci Soc 2014;25:1057-1067.
17 Kim H, Jeong HB, Jung HY, Park T, Park M. Multivariate cluster-based multifactor dimensionality reduction to identify genetic interactions for multiple quantitative phenotypes. Biomed Res Int 2019;2019:4578983.
18 Park M, Lee JW, Park T, Lee S. Gene-gene interaction analysis for the survival phenotype based on the Kaplan-Meier median estimate. Biomed Res Int 2020;2020:5282345.
19 Choi J, Park T. Multivariate generalized multifactor dimensionality reduction to detect gene-gene interactions. BMC Syst Biol 2013;7 Suppl 6:S15.
20 Yu W, Kwon MS, Park T. Multivariate quantitative multifactor dimensionality reduction for detecting gene-gene interactions. Hum Hered 2015;79:168-181.   DOI
21 Park M, Jeong HB, Lee JH, Park T. Spatial rank-based multifactor dimensionality reduction to detect gene-gene interactions for multivariate phenotypes. BMC Bioinformatics 2021;22:480.
22 Gorriz JM, Jimenez-Mesa C, Segovia F, Ramirez J, Suckling J. A connection between pattern classification by machine learning and statistical inference with the general linear model. IEEE J Biomed Health Inform 2022;26:5332-5343.   DOI
23 Iddamalgoda L, Das PS, Aponso A, Sundararajan VS, Suravajhala P, Valadi JK. Data mining and pattern recognition models for identifying inherited diseases: challenges and implications. Front Genet 2016;7:136.
24 Okazaki A, Ott J. Machine learning approaches to explore digenic inheritance. Trends Genet 2022;38:1013-1018.   DOI
25 Lucek PR, Ott J. Neural network analysis of complex traits. Genet Epidemiol 1997;14:1101-1106.
26 El-Dahshan EA, Bassiouni MM, Hagag A, Chakrabortty RK, Loh H, Acharya UR. RESCOVIDTCNnet: a residual neural network-based framework for COVID-19 detection using TCN and EWT with chest X-ray images. Expert Syst Appl 2022;204:117410.
27 Alali M, Mayampurath A, Dai Y, Bartlett AH. A prediction model for bacteremia and transfer to intensive care in pediatric and adolescent cancer patients with febrile neutropenia. Sci Rep 2022;12:7429.
28 Noble WS. What is a support vector machine? Nat Biotechnol 2006;24:1565-1567.
29 Shen Y, Liu Z, Ott J. Detecting gene-gene interactions using support vector machines with L1 penalty. In: 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops, 2010 Dec 18, Hong Kong, China. New York: Institute of Electrical and Electronics Engineers, 2010. pp. 309-311.
30 Vani T. Impetus to machine learning in cardiac disease diagnosis. In: Image Processing for Automated Diagnosis of Cardiac Diseases (Chauhan K, Chauhan RK, eds.). Cambridge: Academic Press, 2021. pp. 99-116.
31 Handing EP, Strobl C, Jiao Y, Feliciano L, Aichele S. Predictors of depression among middle-aged and older men and women in Europe: A machine learning approach. Lancet Reg Health Eur 2022;18:100391.
32 Goeman JJ, Solari A. Multiple hypothesis testing in genomics. Stat Med 2014;33:1946-1978.   DOI
33 Chen Z, Boehnke M, Wen X, Mukherjee B. Revisiting the genome-wide significance threshold for common variant GWAS. G3 (Bethesda) 2021;11:jkaa056.
34 Mantel N. Assessing laboratory evidence for neoplastic activity. Biometrics 1980;36:381-399.   DOI
35 Terada A, Okada-Hatakeyama M, Tsuda K, Sese J. Statistical significance of combinatorial regulations. Proc Natl Acad Sci U S A 2013;110:12996-13001.   DOI
36 Cheverud JM. A simple correction for multiple comparisons in interval mapping genome scans. Heredity (Edinb) 2001;87:52-58.   DOI
37 Agresti A. Categorical Data Analysis. 2nd ed. New York: Wiley-Interscience, 2002.
38 Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann Stat 2001;29:1165-1188.
39 Manly BF, Navarro Alberto JA. Randomization, bootstrap and Monte Carlo methods in biology. Boca Raton: Taylor & Francis, 2021.
40 Benjamini Y, Drai D, Elmer G, Kafkafi N, Golani I. Controlling the false discovery rate in behavior genetics research. Behav Brain Res 2001;125:279-284.
41 Liu Z, Ott J, Shen Y. P-value distribution in case-control association studies. In: 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops, 2010 Dec 18, Hong Kong, China. New York: Institute of Electrical and Electronics Engineers, 2010. pp. 306-308.
42 Ott J, Liu Z, Shen Y. Challenging false discovery rate: a partition test based on p values in human case-control association studies. Hum Hered 2012;74:45-50.   DOI
43 Breuer R, Mattheisen M, Frank J, Krumm B, Treutlein J, Kassem L, et al. Detecting significant genotype-phenotype association rules in bipolar disorder: market research meets complex genetics. Int J Bipolar Disord 2018;6:24.
44 Dewan A, Liu M, Hartman S, Zhang SS, Liu DT, Zhao C, et al. HTRA1 promoter polymorphism in wet age-related macular degeneration. Science 2006;314:989-992.   DOI
45 Moore JH, Andrews PC. Epistasis analysis using multifactor dimensionality reduction. Methods Mol Biol 2015;1253:301-314.   DOI
46 Kerner G, Bouaziz M, Cobat A, Bigio B, Timberlake AT, Bustamante J, et al. A genome-wide case-only test for the detection of digenic inheritance in human exomes. Proc Natl Acad Sci U S A 2020;117:19367-19375.   DOI
47 Hashimoto L, Habita C, Beressi JP, Delepine M, Besse C, Cambon-Thomsen A, et al. Genetic mapping of a susceptibility locus for insulin-dependent diabetes mellitus on chromosome 11q. Nature 1994;371:161-164.   DOI
48 Chee CH, Jaafar J, Aziz IA, Hasan MH, Yeoh W. Algorithms for frequent itemset mining: a literature review. Artif Intell Rev 2019;52:2603-2621.   DOI
49 Nasreen S, Azam MA, Shehzad K, Naeem U, Ghazanfar MA. Frequent pattern mining algorithms for finding associated frequent patterns for data streams: a survey. Procedia Comput Sci 2014;37:109-116.   DOI
50 Mendel G. Versuche uber Pflanzen-Hybriden. Verh Naturforsch Ver Brunn 1866;4:3-47.
51 Strauch K, Fimmers R, Baur MP, Wienker TF. How to model a complex trait. 2. Analysis with two disease loci. Hum Hered 2003;56:200-211.
52 Culverhouse R, Suarez BK, Lin J, Reich T. A perspective on epistasis: limits of models displaying no main effect. Am J Hum Genet 2002;70:461-471.   DOI
53 Boeva V. Analysis of genomic sequence motifs for deciphering transcription factor binding and transcriptional regulation in eukaryotic cells. Front Genet 2016;7:24.
54 Hoh J, Jin S, Parrado T, Edington J, Levine AJ, Ott J. The p53MH algorithm and its application in detecting p53-responsive genes. Proc Natl Acad Sci U S A 2002;99:8467-8472.   DOI
55 Perez CA, Ott J, Mays DJ, Pietenpol JA. p63 consensus DNA-binding site: identification, analysis and application into a p63MH algorithm. Oncogene 2007;26:7363-7370.   DOI
56 Talukder A, Barham C, Li X, Hu H. Interpretation of deep learning in genomics and epigenomics. Brief Bioinform 2021;22:bbaa177.
57 Smith M. DNA sequence analysis in clinical medicine, proceeding cautiously. Front Mol Biosci 2017;4:24.
58 Aggarwal CC, Han J. Frequent Pattern Mining. Cham: Springer, 2014.
59 Zimmermann A, Nijssen S. Supervised pattern mining and applications to classification. In: Frequent Pattern Mining (Aggarwal CC, Han J, eds.). Cham: Springer International Publishing, 2014. pp. 425-442.
60 Agrawal R, Imielinski T, Swami A. Mining association rules between sets of items in large databases. In: ACM SIGMOD International Conference on Management of Data (Buneman P, Jajodia S, Kim W, eds.). New York: Association for Computing Machinery, 1993. pp. 207-216.
61 Agrawal R, Srikant R. Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th VLCB Conference on Very Large Data Bases, 1994 Sep 12-15, Santiago, Chile. 1994. San Francisco: Morgan Kaufmann Publishers, 1994. pp 487-499.
62 Zhang Q, Long Q, Ott J. AprioriGWAS, a new pattern mining strategy for detecting genetic variants associated with disease through interaction effects. PLoS Comput Biol 2014;10:e1003627.
63 Okazaki A, Horpaopan S, Zhang Q, Randesi M, Ott J. Genotype pattern mining for pairs of interacting variants underlying digenic traits. Genes (Basel) 2021;12:1160.
64 Libbrecht MW, Noble WS. Machine learning applications in genetics and genomics. Nat Rev Genet 2015;16:321-332.   DOI
65 Liu J, Paulsen S, Sun X, Wang W, Nobel A, Prins J. Mining approximate frequent itemsets in the presence of noise: algorithm and analysis. In: Proceedings of the 2006 SIAM International Conference on Data Mining (SDM) (Ghosh J, Lambert D, Skillicorn D, Srivastava J, eds.). Philadelphia: Society for Industrial and Applied Mathematics, 2006. pp. 407-418.
66 Bashir S. An efficient pattern growth approach for mining fault tolerant frequent itemsets. Expert Syst Appl 2020;143:113046.
67 Vreeken J, Tatti N. Interesting patterns. In: Frequent Pattern Mining (Agagarwal CC, Han J, eds.). Cham: Springer International Publishing, 2014. pp. 105-134.
68 Pellegrina L, Vandin F. Efficient mining of the most significant patterns with permutation testing. Data Min Knowl Discov 2020;34:1201-1234.   DOI
69 Tonon A, Vandin F. Permutation strategies for mining significant sequential patterns. In: 2019 IEEE International Conference on Data Mining (ICDM), 2019 Nov 8-11, Beijing, China. Piscataway: Institute of Electrical and Electronics Engineers, 2019. pp. 1330-1335.
70 Llinares-Lopez F, Sugiyama M, Papaxanthos L, Borgwardt K. Fast and memory-efficient significant pattern mining via permutation testing. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015 Aug 10-13, Sydney, Australia. New York: Association for Computing Machinery, 2015. pp. 725-734.
71 Pinxteren S, Calders T. Efficient permutation testing for significant sequential patterns. In: Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), 2021 Apr 29-May 1, Virtual. Philadelphia: Society for Industrial and Applied Mathematics, 2021. pp. 19-27.
72 Zihayat M, Davoudi H, An A. Mining significant high utility gene regulation sequential patterns. BMC Syst Biol 2017;11:109.
73 Fournier-Viger P, Lin JC, Truong Chi T, Nkambou R. A survey of high utility itemset mining. In: High-Utility Pattern Mining: Theory, Algorithms and Applications (Fournier-Viger P, Lin JC, Nkambou R, Vo B, Tseng VS, eds.). Cham: Springer International Publishing, 2019. pp. 1-45.
74 Govender P, Fashoto SG, Maharaj L, Adeleke MA, Mbunge E, Olamijuwon J, et al. The application of machine learning to predict genetic relatedness using human mtDNA hypervariable region I sequences. PLoS One 2022;17:e0263790.
75 Pakstis AJ, Speed WC, Soundararajan U, Rajeevan H, Kidd JR, Li H, et al. Population relationships based on 170 ancestry SNPs from the combined Kidd and Seldin panels. Sci Rep 2019;9: 18874.