Development of a Malicious URL Machine Learning Detection Model Reflecting the Main Feature of URLs

Kim, Youngjun;Lee, Jaewoo;

doi:10.6109/jkiice.2022.26.12.1786

Journal of the Korea Institute of Information and Communication Engineering (한국정보통신학회논문지)

Volume 26 Issue 12
/
Pages.1786-1793
/
2022
/
2234-4772(pISSN)
/
2288-4165(eISSN)

The Korea Institute of Information and Commucation Engineering (한국정보통신학회)

DOI QR Code

Development of a Malicious URL Machine Learning Detection Model Reflecting the Main Feature of URLs

URL 주요특징을 고려한 악성URL 머신러닝 탐지모델 개발

Kim, Youngjun (Department of Convergence Security, Chung-Ang University) ;
Lee, Jaewoo (Department of Industrial Security, Chung-Ang University)

김영준 ;
이재우

Received : 2022.10.21
Accepted : 2022.11.01
Published : 2022.12.31

https://doi.org/10.6109/jkiice.2022.26.12.1786 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Cyber-attacks such as smishing and hacking mail exploiting COVID-19, political and social issues, have recently been continuous. Machine learning and deep learning technology research are conducted to prevent any damage due to cyber-attacks inducing malicious links to breach personal data. It has been concluded as a lack of basis to judge the attacks to be malicious in previous studies since the features of data set were excessively simple. In this paper, nine main features of three types, "URL Days", "URL Word", and "URL Abnormal", were proposed in addition to lexical features of URL which have been reflected in previous research. F1-Score and accuracy index were measured through four different types of machine learning algorithms. An improvement of 0.9% in a result and the highest value, 98.5%, were examined in F1-Score and accuracy through comparatively analyzing an existing research. These outcomes proved the main features contribute to elevating the values in both accuracy and performance.

최근 코로나 19, 정치적 상황 등 사회적 현안을 악용한 스미싱, 해킹메일 공격이 지속되고 있다. 공격의 대부분은 악성 URL 접근을 유도하여 개인정보를 탈취하는 방식을 취하고 있는데, 이를 대비하기 위해 현재 머신러닝, 딥러닝 기술 연구가 활발하게 진행되고 있다. 하지만 기존 연구에서는 데이터 세트의 특징들이 단순하기 때문에 악성으로 판별할 근거가 부족하다고 판단하였다. 본 논문에서는 URL 데이터 분석을 통해 기존 연구에 반영된 URL 어휘적인 특징 이외에도 "URL Days", "URL Words", "URL Abnormal" 3종, 9개 주요특징을 추가 제안하였고, 4개의 머신러닝 알고리즘 적용을 통해 F1-Score, 정확도 지표로 측정하였다. 기존 연구와 비교 분석 시 평균 0.9%가 향상된 결과 값과 F1-Score, 정확도에서 최고 98.5%가 측정됨에 따라 주요특징이 정확도 및 성능 향상에 기여하였다.

Keywords

References

N. S. Kim, "Ministry of Science and ICT, '21 cyber threat analysis and '22 viewpoint analysis," Ministry of Science and ICT, 2021. [Internet]. Available: https://doc.msit.go.kr/SynapDocViewServer/viewer/doc.html?key=7d38743144ff45fb8688b4f2255dfc13&convType=html&convLocale=ko_KR&contextPath=/SynapDocViewServer/.
Spotting and blacklisting malicious COVID-19-themed sites [Internet]. Available: https://www.helpnetsecurity.com/2020/04/07/covid-19-malicious-sites/.
Y. B. Kwon and I. S. Kim, "A Study on Anomaly Signal Detection and Management Model using Big Data," The Journal of The Institute of Internet, Broadcasting and Communication, vol. 16, no. 6, pp. 287-294, Dec. 2016. https://doi.org/10.7236/JIIBC.2016.16.6.287
S. G. Lee, D. W. Kim, B. J. Kim, T. W. Lee, S. W. Han, and J. K. Lee, "Comprehensive Analysis Strategy in Cyber Threat Intelligence Environment," Review of KIISC, vol. 31, no. 5, pp. 33-38, Oct. 2021.
Leading the domestic security market with AI technology [Internet]. Available: http://www.itdaily.kr/news/articleView.html?idxno=206661.
J. K. Kim, M. H. Jang, S. N. Lim, and M. S. Kim, "A Study on the Detection Method of Malicious URLs based on the Internet Search Engines using the Machine Learning," The Transactions of The Korean Institute of Electrical Engineers, vol. 70, no. 1, pp. 114-120, Jan. 2021.
H. K. Kang, S. S. Shin, D. Y. Kim, and S. T. Park, "Design and Implementation of Malicious URL Prediction System based on Multiple Machine Learning Algorithms," Journal of Korea Multimedia Society, vol. 23, no. 11, pp. 1396-1405, Nov. 2020. https://doi.org/10.9717/KMMS.2020.23.11.1396
A. Hevapathige and K. Rathnayake, "Super Learner for Malicious URL Detection," in Proceedings of 2022 2nd International Conference on Advanced Research in Computing (ICARC), Belihuloya, Sri Lanka, pp. 114-119, 2022.
Y. Chen, Y. Zhou, Q. Dong, and Q. Li, "A Malicious URL Detection Method Based on CNN," in Proceedings of 2020 IEEE Conference on Telecommunications, Optics and Computer Science (TOCS), Shenyang, China, pp. 23-28, 2020.
University of new brunswick ISCX-URL2016 URL dataset [Internet]. Available: https://www.unb.ca/cic/datasets/url2016.html.
Phishing URLs provided by Phishing Tank [Internet]. Available: http://data.phishtank.com/data/online-valid.csv.
Malicious URLs provided by URLhaus [Internet]. Available: https://urlhaus.abuse.ch/.
Phishing websites provided by OpenPhish [Internet]. Available: https://openphish.com/.
Multinational Open Content Directory on World Wide Web Links by DMOZ [Internet]. Available: https://www.dmoz-odp.org.
The Internet Society, "Rfc3986: Uniform resource identifier (uri): Generic syntax," 2005. [Online]. Available: https://tools.ietf.org/html/rfc3986.
J. S. Park, "Based on URL pattern analysis Preventive measures against harmful sites," M. S. thesis, Konkuk University, 2019.
C. M. Kwon, Python Machine Learning Perfect Guide, Gyeonggi, Korea, Wikibook, 2019.

Journal of the Korea Institute of Information and Communication Engineering (한국정보통신학회논문지)

Development of a Malicious URL Machine Learning Detection Model Reflecting the Main Feature of URLs

URL 주요특징을 고려한 악성URL 머신러닝 탐지모델 개발

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)