DOI QR코드

DOI QR Code

Machine Learning-based Phishing Website Detection Model

머신러닝 기반 피싱 사이트 탐지 모델

  • 오수민 (서울여자대학교 데이터사이언스학과) ;
  • 박민서 (서울여자대학교 데이터사이언스학과)
  • Received : 2024.04.18
  • Accepted : 2024.06.10
  • Published : 2024.07.31

Abstract

Detecting the status of websites, normal or phishing, is necessary to defend against intelligent phishing attacks. We propose a machine learning-based classification to predict the status of websites. First, we collect information about 'URL', convert it into numerical data, and remove outliers. Second, we apply VIF(Variance Inflation Factors) to understand the correlation and independence between variables. Finally, we develop a phishing website detection model with machine learning-based classifications, which predicts website status. In the test datasets, Random Forest showed the best performance, with precision of 93.74%, recall of 92.26%, and accuracy of 93.14%. In the future, we expect to apply our model to detect various phishing crimes.

소셜 미디어의 대중화로 지능화된 피싱 공격을 방어하기 위해 접근하고자 하는 사이트의 상태(정상/피싱)를 판별하는 것이 필요하다. 본 연구에서는 머신러닝 기반 분류 모델을 통해 사이트의 정상/피싱 여부를 예측하는 모델을 제안한다. 첫째, 'URL'에 대한 정보를 수집하여 수치 데이터로 변환한 후, 이상치를 제거한다. 둘째, 변수들 간의 상관관계 및 독립성을 파악하기 위해 VIF(Variance Inflation Factors)를 적용한다. 셋째, 머신러닝 기반 분류 모델을 활용하여 피싱 사이트 탐지 모델을 개발하고, 이를 통해 사이트의 상태를 예측한다. 분류 모델 중 랜덤 포레스트(Random Forest)의 성능이 가장 우수했으며, 테스트 데이터에서 정밀도(Precision) 93.74%, 재현율(Recall) 92.26%, 정확도(Accuracy) 93.14%를 보였다. 향후 이 연구는 다방면의 피싱 범죄 탐지에 적용할 수 있을 것으로 기대된다.

Keywords

Acknowledgement

이 논문은 서울여자대학교 학술연구비의 지원에 의한 것임 (2024-0112).

References

  1. N. Suryavanshi and A. Jain, "A Review of Various Techniques for Detection and Prevention for Phishing Attack," International Journal of Advanced Computer Technology (IJACT), Vol. 4, No. 3, pp. 41-46, 2015. 
  2. Financial Supervisory Service, 2024. Available online: www.fss.or.kr (accessed on 04 April 2024) 
  3. J. Yoon and S. Buu, "Deep Character-level Anomaly Detection based on a Transformer-style Convolutional Autoencoder for Phishing URL Detection," Proceedings of KIIT Conference, pp. 114-118, 2023. 
  4. APWG, 2024. Available online: https://apwg.org/ (accessed on 04 April 2024) 
  5. Police Department, 2024. Available online: www. police.go.kr (accessed on 04 April 2024) 
  6. Y. Huang, Q. Yang, J. Qin, and W. Wen, "Phishing URL Detection via CNN and Attention-Based Hierarchical RNN," IEEE International Conference on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications, pp. 112-119, 2019. 
  7. A. Ali, Q. Jiang, Q. Qu, M. Huang, and J. P. Niyigena, "An effective phishing detection model based on character level convolutional neural network from URL," Electronics, Vol. 9, No. 9, 2020. DOI: 10.3390/electronics9091514 
  8. H. Yuan, Z. Yang, X. Chen, Y. Li, and W. Liu, "URL2Vec: URL modeling with character embeddings for fast and accurate phishing website detection," IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications, pp. 265-272, 2018. 
  9. D. G. Kleinbaum and M. Klein, "Logistic Regression: A Self-Learning Text," 2010. DOI: 10.1007/978-1-4419-1742-3 
  10. S. Greco, B. Matarazzo, and R. Slowinski, "Decision Rule Approach," Multiple criteria decision analysis: state of the art surveys, pp. 497-552, 2016. 
  11. D. J. Hand, "Principles of Data Mining," Drug safety, Vol. 30, pp. 621-622, 2007.  https://doi.org/10.2165/00002018-200730070-00010
  12. S. J. Rigatti, "Random Forest," Journal of Insurance Medicine, Vol. 47, No. 1, pp. 31-39, 2017. DOI: 10.17849/insm-47-01-31-39.1 
  13. Y. Y. Song and Y. Lu, "Decision tree methods: applications for classification and prediction," Shanghai archives of psychiatry, Vol. 27, No. 2, pp. 130-135, April 2015. DOI: 10.11919/j.issn.1002-0829.215044 
  14. L. Breiman, "Random Forest," Machine learning, Vol. 45, pp. 5-32, October, 2001. DOI: 10.1023/A:1010933404324 
  15. Kaggle, 2024. Available online: https://www.kaggle.com/ (accessed on 04 April 2024) 
  16. J. K. Harris, "Primer on binary logistic regression," Family medicine and community health, Vol. 9, Suppl. 1, 2021. DOI: 10.1136/fmch-2021-001290 
  17. J. L. Grabmeier and L. A. Lambe, "Decision trees for binary classification variables grow equally with the Gini impurity measure and Pearson's chi-square test," International journal of business intelligence and data mining, Vol. 2, No. 2, pp. 213-226, June 2007. DOI: 10.1504/IJBIDM.2007.013938 
  18. S. Sohn, H. Yang, and M. Park, "Analysis of Risk Factors for Youth Population Outflow in Busan Based on Machine Learning," The Journal of the Convergence on Culture Technology (JCCT), Vol. 9, No. 6, pp. 131-136, November 2023, DOI:10.17703/JCCT.2023.9.6.131 
  19. Z. H. Hoo, J. Candlish, and D. Teare, "What is a ROC curve?," Emergency Medicine Journal, Vol. 34, No. 6, pp. 357-359, May 2017 https://doi.org/10.1136/emermed-2017-206735