DOI QR코드

DOI QR Code

Topic Automatic Extraction Model based on Unstructured Security Intelligence Report

비정형 보안 인텔리전스 보고서 기반 토픽 자동 추출 모델

  • Hur, YunA (Department of Computer Science and Egineering, Korea University) ;
  • Lee, Chanhee (Department of Computer Science and Egineering, Korea University) ;
  • Kim, Gyeongmin (Department of Computer Science and Egineering, Korea University) ;
  • Lim, HeuiSeok (Department of Computer Science and Egineering, Korea University)
  • Received : 2019.04.24
  • Accepted : 2019.06.20
  • Published : 2019.06.28

Abstract

As cyber attack methods are becoming more intelligent, incidents such as security breaches and international crimes are increasing. In order to predict and respond to these cyber attacks, the characteristics, methods, and types of attack techniques should be identified. To this end, many security companies are publishing security intelligence reports to quickly identify various attack patterns and prevent further damage. However, the reports that each company distributes are not structured, yet, the number of published intelligence reports are ever-increasing. In this paper, we propose a method to extract structured data from unstructured security intelligence reports. We also propose an automatic intelligence report analysis system that divides a large volume of reports into sub-groups based on their topics, making the report analysis process more effective and efficient.

지능형 사이버 공격 기법이 다양화됨에 따라 보안 침해 사건, 글로벌 범죄 등의 사건 발생이 증가하고 있다. 지능형 공격을 예측하고 대응하기 위해서는 공격 기법의 특성, 수법, 유형을 파악해야 한다. 이를 위해 수많은 보안 기업 회사에서는 다양한 공격 기법을 빠르게 파악하고 더 큰 피해를 막기 위해 보안 인텔리전스 보고서를 배포한다. 하지만 각 기업에서 배포하는 보고서에 대한 형식이 맞춰져 있지 않으며, 대량의 비정형 보안 인텔리전스 보고서가 배포되고 있다. 본 논문은 비정형한 보안 인텔리전스 보고서에 대한 문제점을 고려하여 정형화된 데이터로 추출하는 방안을 제안한다. 또한, 대량의 보안 인텔리전스 보고서를 파악하기 위해 소요되는 시간을 줄이고자 대량의 보고서를 주제별로 분류할 수 있는 보안 인텔리전스 보고서 토픽 자동 추출 모델을 제안한다.

Keywords

OHHGBW_2019_v10n6_33_f0001.png 이미지

Fig. 1. Example of problematic PDF file whenextracting text

OHHGBW_2019_v10n6_33_f0002.png 이미지

Fig. 2. Topic Modeling based on Security Intelligence Report

OHHGBW_2019_v10n6_33_f0003.png 이미지

Fig. 3. Result of putting test document in TopicModeling

Table 1. When a PDF document is simply extracted as text

OHHGBW_2019_v10n6_33_t0001.png 이미지

Table 2. This is an example of extracting the same PDF document by the method developed in this task

OHHGBW_2019_v10n6_33_t0002.png 이미지

Table 3. Topic by bag-of- words

OHHGBW_2019_v10n6_33_t0003.png 이미지

Table 4. Security Intelligence Report Topic Automatic Extraction Model Satisfaction Evaluation Question

OHHGBW_2019_v10n6_33_t0004.png 이미지

Table 5. Security Intelligence Report Topic Automatic Extraction Model satisfaction

OHHGBW_2019_v10n6_33_t0005.png 이미지

References

  1. S. Y. Lee. (2018. 06. 18). Microsoft Announces Cyber Security Threat Report. News of SecuN, p. 1.
  2. T. K. Kim &H. R Choi &H. C. Lee. (2016). A Study on the Research Trends inFintech using Topic Modeling. Journal of the Korea Academia-Industrial cooperation Society, 7(11), 670-681. DOI :10.5762/KAIS.2016.17.11.670
  3. L. Hong & B. D. Davison. (2010, July). Empirical study of topic modeling in twitter. In Proceedings of the first workshop onsocial media analytics(ACM), 80-88.
  4. N. C. Ho .(2016). An Illustrative Application of Topic Modeling Method to a Farmer's Diary. INSTITUTE OFCROSS-CULTURAL STUDIES, 22(1), 89-135.
  5. R. Krestel, P. Fankhauser & W. Nejdl. (2009, October). Latentdirichlet allocation for tag recommendation. In Proceedings of the third ACM conference on Recommender systems, 61-68.
  6. Y. A Hur, D. Y. Lee, K. K. Kim, W. H. Yu & H. S. Lim. (2017). A System for Automatic Classification of Traditional Culture Texts. Journal of the Korea Convergence Society, 8(12), 39-47. https://doi.org/10.15207/JKCS.2017.8.12.039
  7. B. I. Kang, M. Song, W. Jho. (2013). A Study on Opinion Mining of News paper Texts based on Topic Modeling. Journal of The Korean Society For Library And Information Science, 47(4), 315-334. https://doi.org/10.4275/KSLIS.2013.47.4.315
  8. J. H. Bae, N. G. Han & M. Song (2014). Twitter Issue Tracking System by Topic Modeling Techniques. Journal of Intelligence and Information System, 20(20), 109-122.
  9. H. G Kim, S. U. Kim & S. T. Kim. (2018). Topic Modeling of Media Reports on Smartphone Addiction - A Study on the Comparison of Government Policies between 2010 and 2018. Korean Association for Braodcasting & Telecommunication Studies, 104, 38-62.
  10. N. Potha & E. Stamatatos. (2019). Improving author verification based on topic modeling. Journal of the Association for Information Science and Technology, 0(0), 1-15. DOI :10.1002/asi.24183
  11. H. H. Gill. (2018) The Study of Korean Stopwords list for Textmining, URIMALGEUL: The Korean Language and Literature, 78, 1-25. https://doi.org/10.18628/urimal.78..201809.1
  12. H. M. Wallach. (2006). Topic modeling: beyond bag-of-words. In Proceedings of the 23rd international conference on Machinelearning(ACM), 977-984.
  13. J. Yang, Y. G. Jiang, A. G. Hauptmann & C. W. Ngo. (2007). Evaluating bag-of-visual-words representations in scene classification. In Proceedings of the international workshop on Workshop on multimedia information retrieval(ACM), 197-206.
  14. D. M. Blei, A. Y. Ng & M. I. Jordan. (2003). Latent Dirichlet Allocation, Journal of Machine Learning Research, 3(Jan), 993-1022. DOI: 10.1162/jmlr.2003.3.4.-5.993
  15. Y. Guo, S. J. Barnes & Q. Jia. (2017). Mining meaning from online ratings and reviews: Tourist satisfaction analysis using latent dirichlet allocation, Tourism Management, 59, 467-483. https://doi.org/10.1016/j.tourman.2016.09.009