[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.22937/IJCSNS.2021.21.9.39

URL Phishing Detection System Utilizing Catboost Machine Learning Approach

Fang, Lim Chian (Information Security Forensics and Computer Networking (INSFORNET), Fakulti Teknologi Maklumat dan Komunikasi, Universiti Teknikal Malaysia Melaka (UTeM))
Ayop, Zakiah (Information Security Forensics and Computer Networking (INSFORNET), Fakulti Teknologi Maklumat dan Komunikasi, Universiti Teknikal Malaysia Melaka (UTeM))
Anawar, Syarulnaziah (Information Security Forensics and Computer Networking (INSFORNET), Fakulti Teknologi Maklumat dan Komunikasi, Universiti Teknikal Malaysia Melaka (UTeM))
Othman, Nur Fadzilah (Information Security Forensics and Computer Networking (INSFORNET), Fakulti Teknologi Maklumat dan Komunikasi, Universiti Teknikal Malaysia Melaka (UTeM))
Harum, Norharyati (Information Security Forensics and Computer Networking (INSFORNET), Fakulti Teknologi Maklumat dan Komunikasi, Universiti Teknikal Malaysia Melaka (UTeM))
Abdullah, Raihana Syahirah (Information Security Forensics and Computer Networking (INSFORNET), Fakulti Teknologi Maklumat dan Komunikasi, Universiti Teknikal Malaysia Melaka (UTeM))

Publication Information

International Journal of Computer Science & Network Security / v.21, no.9, 2021 , pp. 297-302 More about this Journal

Abstract

The development of various phishing websites enables hackers to access confidential personal or financial data, thus, decreasing the trust in e-business. This paper compared the detection techniques utilizing URL-based features. To analyze and compare the performance of supervised machine learning classifiers, the machine learning classifiers were trained by using more than 11,005 phishing and legitimate URLs. 30 features were extracted from the URLs to detect a phishing or legitimate URL. Logistic Regression, Random Forest, and CatBoost classifiers were then analyzed and their performances were evaluated. The results yielded that CatBoost was much better classifier than Random Forest and Logistic Regression with up to 96% of detection accuracy.

Keywords

Phishing; URL; CatBoost; Logistic Regression; Random Forest;

Citations & Related Records

Reference

1	A.-P. W. Group, "Phishing Activity Trends Report," 2021.
2	S. Masurkar and V. Dalal, "ENHANCED MODEL FOR DETECTION OF PHISHING URL USING MACHINE LEARNING."
3	L. Khairunnahar, M. A. Hasib, R. H. Bin Rezanur, M. R. Islam, and M. K. Hosain, "Classification of malignant and benign tissue with logistic regression," Informatics Med. Unlocked, vol. 16, p. 100189, 2019. DOI
4	"Advantages and Disadvantages of Logistic Regression." [Online]. Available: https://iq.opengenus.org/advantages-and-disadvantages-of-logistic-regression/. [Accessed: 13-Sep-2021].
5	J. Mao et al., "Phishing page detection via learning classifiers from page layout feature," EURASIP J. Wirel. Commun. Netw., vol. 2019, no. 1, pp. 1-14, 2019. DOI
6	R. Mahajan and I. Siddavatam, "Phishing website detection using machine learning algorithms," Int. J. Comput. Appl., vol. 181, no. 23, pp. 45-47, 2018. DOI
7	A. Nahon, "XGBoost, LightGBM or CatBoost - which boosting algorithm should I use?," 30-Dec-2019. [Online]. Available: https://medium.com/riskified-technology/xgboost-lightgbm-or-catboost-whichboosting-algorithm-should-i-use-e7fda7bb36bc. [Accessed: 13-Sep-2021].
8	W. Ali, "Phishing website detection based on supervised machine learning with wrapper features selection," Int. J. Adv. Comput. Sci. Appl., vol. 8, no. 9, pp. 72-78, 2017.
9	A. K. Jain and B. B. Gupta, "PHISH-SAFE: URL features-based phishing detection system using machine learning," in Cyber Security, Springer, 2018, pp. 467-474.
10	V. Patil, P. Thakkar, C. Shah, T. Bhat, and S. P. Godse, "Detection and prevention of phishing websites using machine learning approach," in 2018 Fourth international conference on computing communication control and automation (ICCUBEA), 2018, pp. 1-5.
11	M. S. O. Djediden, H. Reguieg, Z. M. Maaza, and others, "A distributed intrusion detection system based on apache spark and scikit-learn library," J. Appl. Phys. Sci., vol. 5, no. 1, pp. 30-36, 2019.
12	Y. Liu and H. Wu, "Prediction of Road Traffic Congestion Based on Random Forest," in 2017 10th International Symposium on Computational Intelligence and Design (ISCID), 2017, vol. 2, pp. 361-364, doi: 10.1109/ISCID.2017.216. DOI
13	S. Ben Jabeur, C. Gharib, S. Mefteh-Wali, and W. Ben Arfi, "CatBoost model and artificial intelligence techniques for corporate failure prediction," Technol. Forecast. Soc. Change, vol. 166, p. 120658, 2021. DOI
14	"Random Forest Pros & Cons - HolyPython.com." [Online]. Available: https://holypython.com/rf/random-forest-pros-cons/. [Accessed: 13-Sep-2021].
15	J. Hancock and T. M. Khoshgoftaar, "Performance of catboost and xgboost in medicare fraud detection," in 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), 2020, pp. 572-579.
16	"UCI Machine Learning Repository: Phishing Websites Data Set." [Online]. Available: https://archive.ics.uci.edu/ml/datasets/phishing+websites. [Accessed: 13-Sep-2021].
17	D. Mwiti, "Fast Gradient Boosting with CatBoost \| by Derrick Mwiti \| Heartbeat," 16-Jun-2020. [Online]. Available: https://heartbeat.fritz.ai/fast-gradientboosting-with-catboost-38779b0d5d9a. [Accessed: 13-Sep-2021].
18	A. Basit, M. Zafar, X. Liu, A. R. Javed, Z. Jalil, and K. Kifayat, "A comprehensive survey of AI-enabled phishing attacks detection techniques," Telecommun. Syst., pp. 1-16, 2020.