DOI QR코드

DOI QR Code

A Study on Predicting Lung Cancer Using RNA-Sequencing Data with Ensemble Learning

앙상블 기법을 활용한 RNA-Sequencing 데이터의 폐암 예측 연구

  • Geon AN (Department of Medical IT, Eulji University) ;
  • JooYong PARK (Department of Big Data Medical Convergence, Eulji University)
  • Received : 2024.05.20
  • Accepted : 2024.06.14
  • Published : 2024.06.30

Abstract

In this paper, we explore the application of RNA-sequencing data and ensemble machine learning to predict lung cancer and treatment strategies for lung cancer, a leading cause of cancer mortality worldwide. The research utilizes Random Forest, XGBoost, and LightGBM models to analyze gene expression profiles from extensive datasets, aiming to enhance predictive accuracy for lung cancer prognosis. The methodology focuses on preprocessing RNA-seq data to standardize expression levels across samples and applying ensemble algorithms to maximize prediction stability and reduce model overfitting. Key findings indicate that ensemble models, especially XGBoost, substantially outperform traditional predictive models. Significant genetic markers such as ADGRF5 is identified as crucial for predicting lung cancer outcomes. In conclusion, ensemble learning using RNA-seq data proves highly effective in predicting lung cancer, suggesting a potential shift towards more precise and personalized treatment approaches. The results advocate for further integration of molecular and clinical data to refine diagnostic models and improve clinical outcomes, underscoring the critical role of advanced molecular diagnostics in enhancing patient survival rates and quality of life. This study lays the groundwork for future research in the application of RNA-sequencing data and ensemble machine learning techniques in clinical settings.

Keywords

References

  1. Ali, J., Khan, R., & Ahmad, N. (2012). Random forests and decision trees. International Journal of Computer Science Issues, 9(5). Retrieved from https://www.uetpeshawar.edu.pk/TRP-G/Dr.Nasir-AhmadTRP/Journals/2012/Random%20Forests%20and%20Decision%20Trees.pdf
  2. Baradaran Rezaei, H., Amjadian, A., Sebt, M. V., et al. (2023). An ensemble method of the machine learning to prognosticate the gastric cancer. Annals of Operations Research, 328, 151-192. https://doi.org/10.1007/s10479-022-04964-1
  3. Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb), 281-305. Retrieved from https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
  4. BigOmics Analytics. (2023, March 16). What is TPM? Understanding normalization methods for gene expression. BigOmics Analytics. Retrieved from https://bigomics.ch/blog/why-how-normalize-rna-seq-data/
  5. Boateng, E., & Abaye, D. (2019). A review of the logistic regression model with emphasis on medical research. Journal of Data Analysis and Information Processing, 7, 190-207. doi: 10.4236/jdaip.2019.74012.
  6. Bostanci, E., Kocak, E., Unal, M., Guzel, M. S., Acici, K., & Asuroglu, T. (2023). Machine learning analysis of RNA-seq data for diagnostic and prognostic prediction of colon cancer. Sensors, 23(6), 3080. https://doi.org/10.3390/s23063080
  7. Brown, K., Filuta, A., Ludwig, M. G., Seuwen, K., & Jaros, J. (2017). Epithelial Gpr116 regulates pulmonary alveolar homeostasis via Gq/11 signaling. JCI Insight, 2(11), e89704. https://doi.org/10.1172/jci.insight.89704
  8. Czepiel, S. A. (2002). Maximum likelihood estimation of logistic regression models: Theory and implementation. Available at czep.net/stat/mlelr.pdf
  9. Ergin, S., Kherad, N., & Alagoz, M. (2022). RNA sequencing and its applications in cancer and rare diseases. Molecular Biology Reports, 49, 2325-2333. https://doi.org/10.1007/s11033-021-06963-0
  10. Gad, A. A., & Balenga, N. (2020). The emerging role of adhesion GPCRs in cancer. ACS Pharmacology & Translational Science. https://doi.org/10.1021/acsptsci.9b00093
  11. Gohiya, H., Lohiya, H., & Patidar, K. (2018). A survey of XGBoost system. International Journal of Advanced Technology and Engineering Research, 8(7). Retrieved from http://www.ijater.com/Files/aa09b180-add4-4a6d-b234-bc122eb305d4_IJATER_39_07.pdf
  12. Handoyo, S., Pradianti, N., Nugroho, W. H., & Akri, Y. J. (2022). A heuristic feature selection in logistic regression modeling with newton raphson and gradient descent algorithm. International Journal of Advanced Computer Science and Applications, 13(3).
  13. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30. Retrieved from https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html
  14. Li, K., Chen, Y., Sun, R., Yu, B., Li, G., & Jiang, X. (2020). Exploring potential of different X-ray imaging methods for early-stage lung cancer detection. Journal of Medical Imaging and Radiation Sciences, 5(2), 173-183. https://dx.doi.org/10.1007/s41605-020-00173-1
  15. Li, W., Yin, Y., Quan, X., & Zhang, H. (2019). Gene expression value prediction based on XGBoost algorithm. Frontiers in Genetics. Retrieved from https://www.frontiersin.org/articles/10.3389/fgene.2019.01077/full
  16. Louppe, G. (2014). Understanding random forests: From theory to practice. Cornell University Library. Retrieved from https://www.researchgate.net/profile/GillesLouppe/publication/264312332_Understanding-RandomForests_From-Theory-toPractice/links/54ae38ea0cf2213c5fe427b7/UnderstandingRandom-Forests-From-Theory-to-Practice.pdf
  17. Midthun, D. E. (2016). Early detection of lung cancer. F1000Research, 5, F1000 Faculty Rev-739. https://doi.org/10.12688/f1000research.7313.1
  18. Napierala, M. A. (2012). What is the Bonferroni correction?, AAOS Now, 40. Retrieved from https://link.gale.com/apps/doc/A288979427/HRCA?u=anon~94f28a3d&sid=googleScholar&xid=d9841e38
  19. Nooreldeen, R., & Bach, H. (2021). Current and future development in lung cancer diagnosis. International Journal of Molecular Sciences, 22(16), 8661. https://doi.org/10.3390/ijms22168661
  20. Park, S.-K., Kim, S., Lee, G.-Y., Kim, S.-Y., Kim, W., Lee, C.-W., Park, J.-L., Choi, C.-H., Kang, S-B., & Kim, T.-O., et al. (2021). Development of a machine learning model to distinguish between ulcerative colitis and Crohn's disease using RNA sequencing data. Diagnostics, 11(12), 2365. https://doi.org/10.3390/diagnostics11122365
  21. Piao, Y., Choi, N. H., Li, M., Piao, M., & Ryu, K. H. (2014). Ensemble method for prediction of prostate cancer from RNA-Seq data. Science Technology, 51-56.
  22. Roderburg, C., Loosen, S. H., & Hippe, H. J. (2022). Pulmonary hypertension is associated with an increased incidence of cancer diagnoses. Pulmonary Circulation, 12(1), e12000. https://doi.org/10.1002/pul2.12000
  23. World Health Organization. (2020). Global Cancer Observatory: Cancer today. International Agency for Research on Cancer. Available from https://gco.iarc.fr/today/data/factsheets/cancers/15-Lungfact-sheet.pdf
  24. Witten, D., & Tibshirani, R. (2007). A comparison of fold-change and the t-statistic for microarray data analysis. Analysis, 1776, 58-85.
  25. Yadav, S., & Shukla, S. (2016). Analysis of k-fold cross-validation over hold-out validation on colossal datasets for quality classification. 2016 IEEE 6th International Conference on Advanced Computing (IACC), 78-83. doi: 10.1109/IACC.2016.25
  26. Zappa, C., & Mousa, S. A. (2016). Non-small cell lung cancer: Current treatment and future advances. Translational Lung Cancer Research, 5(3), 288-300. https://doi.org/10.21037/tlcr.2016.06.07
  27. Zhang, L., Geisler, T., Ray, H., & Xie, Y. (2022). Improving logistic regression on the imbalanced data by a novel penalized log-likelihood function. Journal of Applied Statistics, 49(13), 3257-3277.