DOI QR코드

DOI QR Code

Genetic classification of various familial relationships using the stacking ensemble machine learning approaches

  • Su Jin Jeong (Statistics Support Part, Medical Science Research Institute, Kyung Hee University Medical Center) ;
  • Hyo-Jung Lee (Product Development HQ, Dong-A ST) ;
  • Soong Deok Lee (Department of Forensic Medicine, College of Medicine, Seoul National University) ;
  • Ji Eun Park (Department of Statistics, Korea University) ;
  • Jae Won Lee (Department of Statistics, Korea University)
  • Received : 2023.05.24
  • Accepted : 2024.01.12
  • Published : 2024.05.31

Abstract

Familial searching is a useful technique in a forensic investigation. Using genetic information, it is possible to identify individuals, determine familial relationships, and obtain racial/ethnic information. The total number of shared alleles (TNSA) and likelihood ratio (LR) methods have traditionally been used, and novel data-mining classification methods have recently been applied here as well. However, it is difficult to apply these methods to identify familial relationships above the third degree (e.g., uncle-nephew and first cousins). Therefore, we propose to apply a stacking ensemble machine learning algorithm to improve the accuracy of familial relationship identification. Using real data analysis, we obtain superior relationship identification results when applying meta-classifiers with a stacking algorithm rather than applying traditional TNSA or LR methods and data mining techniques.

Keywords

Acknowledgement

This research was supported and funded by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. RS-2023-00208882) and the Korean National Police Agency [Project Name: Advancing the Appraisal Techniques of Forensic Entomology / Project Number: PR10-04-000-22].

References

  1. Altman NS (1992). An introduction to kernel and nearest-neighbor nonparametric regression, The American Statistician, 46, 175-185.
  2. Basheer IA and Hajmeer M (2000). Artificial neural networks: Fundamentals, computing, design, and application, Journal of Microbiological Methods, 43, 3-31.
  3. Bickel PJ and Levina E (2004). Some theory for Fisher's linear discriminant function, "Naive Bayes", and some alternatives when there are many more variables than observations, Bernoulli, 10, 989-1010.
  4. Bieber FR, Brenner CH, and Lazer D (2006). Human genetics: Finding criminals through DNA of their relatives, Science, 312, 1315-1316.
  5. Breiman L (2001). Random forests, Machine Learning, 45, 5-32.
  6. Budowle B, Shea B, Niezgoda S, and Chakraborty R (2001). CODIS STR loci data from 41 sample populations, Journal of Forensic Sciences, 46, 453-489.
  7. Butler JM and Hill CR (2012). Biology and genetics of new autosomal STR loci useful for forensic DNA analysis, Forensic Science Review, 24, 15-26.
  8. Chen T and Guestrin C (2016). XGBoost: A scalable tree boosting system, In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794.
  9. Cowen S and Thomson J (2008). A likelihood ratio approach to familial searching of large DNA databases, Forensic Science International Genetics Supplement, 1, 643-645.
  10. Cox DR (1958). The regression analysis of binary sequences, Journal of the Royal Statistical Society. Series B (Methodological), 20, 215-242.
  11. Evett IW and Weir BS (1998). Interpreting DNA Evidence: Statistical Genetics for Forensic Scientists, Sinauer Associates, Inc, Seunderland.
  12. Friedman JH (1988). Regularized discriminant analysis, Journal of the American Statistical Association, 84, 165-175.
  13. Jansson J (2016). Decision tree classification of products using C5.0 and prediction of workload using time series analysis, TRITA-EE, Sweden.
  14. Jeong SJ, Lee HJ, Lee SD, Lee SH, Park SJ, Kim JS, and Lee JW (2019). Classification of common relationships based on short tandem repeat profiles using data mining, Korean Journal of Legal Medicine, 43, 97-105.
  15. Jeong SJ, Lee JW, Lee SD, Lee SH, Park SJ, Kim JS, and Lee HJ (2016). Statistical evaluation of common relationships using STR markers in Korean population, The Korean Academy of Scientific Criminal Investigation, 10, 107-115.
  16. Lee JW, Lee HS, and Lee HJ (2007). Statistical evaluation of sibling relationship, Communications for Statistical Applications and Methods, 14, 541-549.
  17. Menon AK (2009). Large-scale Support Vector Machines: Algorithms and Theory, Research Exam, University of California, San Diego, CA.
  18. Myers SP, Timken MD, and Piucci ML (2011). Searching for first-degree familial relationships in California's offender DNA database: Validation of a likelihood ratio-based approach, Forensic Science International: Genetic, 5, 493-500.
  19. Park SJ, Kim YM, and Ahn JJ (2019). Development of product recommender system using collaborative filtering and stacking model, Journal of Convergence for Information Technology, 9, 83-90.
  20. Quinlan JR (2007). Induction of Decision Trees, New South Wales Institute of Technology, Sydney, Australia.
  21. Rho DS and Kim E (2009). A study on the voltage regulation method based on artificial neural networks for distribution systems interconnected with distributed generation, Journal of Korea Academia-Industrial Cooperation Society, 10, 3130-3136.
  22. Rokach L (2010). Ensemble-based classifiers, Artificial Intelligence Review, 33, 1-39.
  23. Schneider PM (2007). Scientific standards for studies in forensic genetics, Forensic Science International, 165, 238-243.
  24. Witten DM, Tibshirani R, and Hastie T (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis, Biostatistics, 10, 515-534.
  25. Yang IS, Lee HY, and Park SJ (2013). Analysis of kinship index distributions in Koreans using simulated autosomal STR profiles, Korean Journal of Legal Medicine, 37, 57-65.