사전학습 언어모델을 활용한 범죄수사 도메인 개체명 인식

A Named Entity Recognition Model in Criminal Investigation Domain using Pretrained Language Model

  • 김희두 (고려대학교 빅데이터융합학과) ;
  • 임희석 (고려대학교 컴퓨터학과)
  • Kim, Hee-Dou (Department of Bigdata Convergence, Korea University) ;
  • Lim, Heuiseok (Department of Computer Science and Engineering, Korea University)
  • 투고 : 2021.11.04
  • 심사 : 2022.02.20
  • 발행 : 2022.02.28


본 연구는 딥러닝 기법을 활용하여 범죄 수사 도메인에 특화된 개체명 인식 모델을 개발하는 연구이다. 본 연구를 통해 비정형의 형사 판결문·수사 문서와 같은 텍스트 기반의 데이터에서 자동으로 범죄 수법과 범죄 관련 정보를 추출하고 유형화하여, 향후 데이터 분석기법을 활용한 범죄 예방 분석과 수사에 기여할 수 있는 시스템을 제안한다. 본 연구에서는 범죄 수사 도메인 텍스트를 수집하고 범죄 분석의 관점에서 필요한 개체명 분류를 새로 정의하였다. 또한 최근 자연어 처리에서 높은 성능을 보이고 있는 사전학습 언어모델인 KoELECTRA를 적용한 제안 모델은 본 연구에서 정의한 범죄 도메인 개체명 실험 데이터의 9종의 메인 카테고리 분류에서 micro average(이하 micro avg) F1-score 99%, macro average(이하 macro avg) F1-score 96%의 성능을 보이고, 56종의 서브 카테고리 분류에서 micro avg F1-score 98%, macro avg F1-score 62%의 성능을 보인다. 제안한 모델을 통해 향후 개선 가능성과 활용 가능성의 관점에서 분석한다.

This study is to develop a named entity recognition model specialized in criminal investigation domains using deep learning techniques. Through this study, we propose a system that can contribute to analysis of crime for prevention and investigation using data analysis techniques in the future by automatically extracting and categorizing crime-related information from text-based data such as criminal judgments and investigation documents. For this study, the criminal investigation domain text was collected and the required entity name was newly defined from the perspective of criminal analysis. In addition, the proposed model applying KoELECTRA, a pre-trained language model that has recently shown high performance in natural language processing, shows performance of micro average(referred to as micro avg) F1-score 98% and macro average(referred to as macro avg) F1-score 95% in 9 main categories of crime domain NER experiment data, and micro avg F1-score 98% and macro avg F1-score 62% in 56 sub categories. The proposed model is analyzed from the perspective of future improvement and utilization.



  1. H. Hassani, X. Huang & E. S. Silva. (2016). A review of data mining applications in crime. Statistical Analysis and Data Mining, 9(3), 139-154. DOI : 10.1002/sam.11312
  2. J. Devlin, M. W. Chang, K. lee & K. Toutanova. (2019). BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. In proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171-4186, DOI : 10.18653/v1/N19-1423
  3. J. H. Lee, W. J. Yoon, S. D. Kim, D. H. Kim, S. K Kim, C. H. So & J. W. Kang. (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234-1240. DOI : 10.1093/bioinformatics/btz682
  4. M. Chau, J. J. Xu & H. Chen. (2002). Extracting meaningful entities from police narrative reports. In Proceedings of the 2002 Annual National Conference on Digital Government Research, Los Angeles
  5. H. Chen, W. Chung, J. J. Xu, G. Wang, Y. Qin, & M. Chau. (2004). Crime data mining: a general framework and some examples. Computer, 37(4), 50-56.
  6. C. H. Ku, IA. riberri & G. Leroy. (2008). Natural language processing and e-government: crime information extraction from heterogeneous data sources. Proceedings of the 9th Annual International Digital Government Research Conference, Canada. 162-170.
  7. R. Bache, F. Crestani, D. Canter & D. Youngs. (2007). Application of Language Models to Suspect Prioritisation and Suspect Likelihood in Serial Crimes. Third International Symposium on Information Assurance and Security, 399-404. DOI : 10.1109/IAS.2007.58
  8. K. R. Rahem & N. Omar. (2014). Drug-related crime information extraction and analysis. Proceedings of the 6th International Conference on Information Technology and Multimedia, pp. 250-254. DOI : 10.1109/ICIMU.2014.7066639
  9. A. Alkaff & M. Mohd. (2013). Extraction of naitonality from crime news. Journal of Theoretical and Applied Information Technology, 54, 304-312.
  10. S. Sathyadevan, M. S. Devan & S. S. Gangadharan (2014). Crime analysis and prediction using data mining. 2014 First International Conference on Networks & Soft Computing (ICNSC2014), 406-412. DOI : 10.1109/CNSC.2014.6906719.
  11. M. Asharef, N. Omar & M. Albared. (2012). Arabic named entity recognition in crime documents. Journal of Theoretical and Applied Information Technology, 44(1), 1-6.
  12. Arulanandam, R., Savarimuthu, B. T. R. & Purvis. M. A. (2014). Extracting crime information from online newspaper articles. Proceedings of the Second Australasian Web Conference, Auckland, New Zealand, 31-38.
  13. J. Johnson, A. Miller, L. Khan, B. Thuraisingham, & M. Kantarcioglu. (2011). Extraction of expanded entity phrases. Proceedings of the IEEE International Conference on Intelligence and Security Informatics, Beijing, China, 107-112. DOI : 10.1109/ISI.2011.5984059
  14. K-S. Yang, C-C. Chen, Y-H. Tseng & Z-P. Ho. (2012). Name entity extraction based on POS tagging for criminal information analysis and relation visualization. Proceedings of the 6th International Conference on New Trends in Information Science and Service Science and Data Mining (ISSDM), October, Taipei. 785-789.
  15. P Gohel. (2016) Crime information extraction from news articles. M Tech Dissertations. Dhirubhai Ambani Institute of Information and Communication Technology. Gandhinagar.
  16. K. Srinivasa & P. S. Thilagam (2019). Crime base: Towards building a knowledge base for crime entities and their relationships from online news papers. Information Processing & Management, 56. DOI : org/10.1016/j.ipm.2019.102059
  17. S. Hochreiter & J. Schmidhuber. (1997). Long short-ter memory. Neural computation, 9(8), 1735-1780. DOI : 10.1162/neco.1997.9.8.1735
  18. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez & I. Polosukhin. (2017). Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 6000-6010.
  19. K. Clark, M. T. Luong, Q. V. Le & C. D. Manning. (2020). ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.