DOI QR코드

DOI QR Code

URL Phishing Detection System Utilizing Catboost Machine Learning Approach

  • Fang, Lim Chian (Information Security Forensics and Computer Networking (INSFORNET), Fakulti Teknologi Maklumat dan Komunikasi, Universiti Teknikal Malaysia Melaka (UTeM)) ;
  • Ayop, Zakiah (Information Security Forensics and Computer Networking (INSFORNET), Fakulti Teknologi Maklumat dan Komunikasi, Universiti Teknikal Malaysia Melaka (UTeM)) ;
  • Anawar, Syarulnaziah (Information Security Forensics and Computer Networking (INSFORNET), Fakulti Teknologi Maklumat dan Komunikasi, Universiti Teknikal Malaysia Melaka (UTeM)) ;
  • Othman, Nur Fadzilah (Information Security Forensics and Computer Networking (INSFORNET), Fakulti Teknologi Maklumat dan Komunikasi, Universiti Teknikal Malaysia Melaka (UTeM)) ;
  • Harum, Norharyati (Information Security Forensics and Computer Networking (INSFORNET), Fakulti Teknologi Maklumat dan Komunikasi, Universiti Teknikal Malaysia Melaka (UTeM)) ;
  • Abdullah, Raihana Syahirah (Information Security Forensics and Computer Networking (INSFORNET), Fakulti Teknologi Maklumat dan Komunikasi, Universiti Teknikal Malaysia Melaka (UTeM))
  • Received : 2021.09.05
  • Published : 2021.09.30

Abstract

The development of various phishing websites enables hackers to access confidential personal or financial data, thus, decreasing the trust in e-business. This paper compared the detection techniques utilizing URL-based features. To analyze and compare the performance of supervised machine learning classifiers, the machine learning classifiers were trained by using more than 11,005 phishing and legitimate URLs. 30 features were extracted from the URLs to detect a phishing or legitimate URL. Logistic Regression, Random Forest, and CatBoost classifiers were then analyzed and their performances were evaluated. The results yielded that CatBoost was much better classifier than Random Forest and Logistic Regression with up to 96% of detection accuracy.

Keywords

Acknowledgement

This publication has been supported by Center of Research and Innovation Management (CRIM), Universiti Teknikal Malysia Melaka (UTeM). The authors would like to thank UTeM and INSFORNET research group members for their supports.

References

  1. A.-P. W. Group, "Phishing Activity Trends Report," 2021.
  2. A. Basit, M. Zafar, X. Liu, A. R. Javed, Z. Jalil, and K. Kifayat, "A comprehensive survey of AI-enabled phishing attacks detection techniques," Telecommun. Syst., pp. 1-16, 2020.
  3. W. Ali, "Phishing website detection based on supervised machine learning with wrapper features selection," Int. J. Adv. Comput. Sci. Appl., vol. 8, no. 9, pp. 72-78, 2017.
  4. A. K. Jain and B. B. Gupta, "PHISH-SAFE: URL features-based phishing detection system using machine learning," in Cyber Security, Springer, 2018, pp. 467-474.
  5. R. Mahajan and I. Siddavatam, "Phishing website detection using machine learning algorithms," Int. J. Comput. Appl., vol. 181, no. 23, pp. 45-47, 2018. https://doi.org/10.5120/ijca2018918026
  6. V. Patil, P. Thakkar, C. Shah, T. Bhat, and S. P. Godse, "Detection and prevention of phishing websites using machine learning approach," in 2018 Fourth international conference on computing communication control and automation (ICCUBEA), 2018, pp. 1-5.
  7. J. Mao et al., "Phishing page detection via learning classifiers from page layout feature," EURASIP J. Wirel. Commun. Netw., vol. 2019, no. 1, pp. 1-14, 2019. https://doi.org/10.1186/s13638-018-1318-8
  8. S. Masurkar and V. Dalal, "ENHANCED MODEL FOR DETECTION OF PHISHING URL USING MACHINE LEARNING."
  9. J. Hancock and T. M. Khoshgoftaar, "Performance of catboost and xgboost in medicare fraud detection," in 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), 2020, pp. 572-579.
  10. "UCI Machine Learning Repository: Phishing Websites Data Set." [Online]. Available: https://archive.ics.uci.edu/ml/datasets/phishing+websites. [Accessed: 13-Sep-2021].
  11. L. Khairunnahar, M. A. Hasib, R. H. Bin Rezanur, M. R. Islam, and M. K. Hosain, "Classification of malignant and benign tissue with logistic regression," Informatics Med. Unlocked, vol. 16, p. 100189, 2019. https://doi.org/10.1016/j.imu.2019.100189
  12. Y. Liu and H. Wu, "Prediction of Road Traffic Congestion Based on Random Forest," in 2017 10th International Symposium on Computational Intelligence and Design (ISCID), 2017, vol. 2, pp. 361-364, doi: 10.1109/ISCID.2017.216.
  13. S. Ben Jabeur, C. Gharib, S. Mefteh-Wali, and W. Ben Arfi, "CatBoost model and artificial intelligence techniques for corporate failure prediction," Technol. Forecast. Soc. Change, vol. 166, p. 120658, 2021. https://doi.org/10.1016/j.techfore.2021.120658
  14. "Advantages and Disadvantages of Logistic Regression." [Online]. Available: https://iq.opengenus.org/advantages-and-disadvantages-of-logistic-regression/. [Accessed: 13-Sep-2021].
  15. "Random Forest Pros & Cons - HolyPython.com." [Online]. Available: https://holypython.com/rf/random-forest-pros-cons/. [Accessed: 13-Sep-2021].
  16. D. Mwiti, "Fast Gradient Boosting with CatBoost | by Derrick Mwiti | Heartbeat," 16-Jun-2020. [Online]. Available: https://heartbeat.fritz.ai/fast-gradientboosting-with-catboost-38779b0d5d9a. [Accessed: 13-Sep-2021].
  17. A. Nahon, "XGBoost, LightGBM or CatBoost - which boosting algorithm should I use?," 30-Dec-2019. [Online]. Available: https://medium.com/riskified-technology/xgboost-lightgbm-or-catboost-whichboosting-algorithm-should-i-use-e7fda7bb36bc. [Accessed: 13-Sep-2021].
  18. M. S. O. Djediden, H. Reguieg, Z. M. Maaza, and others, "A distributed intrusion detection system based on apache spark and scikit-learn library," J. Appl. Phys. Sci., vol. 5, no. 1, pp. 30-36, 2019.