DOI QR코드

DOI QR Code

A Study on the Optimal Search Keyword Extraction and Retrieval Technique Generation Using Word Embedding

워드 임베딩(Word Embedding)을 활용한 최적의 키워드 추출 및 검색 방법 연구

  • Jeong-In Lee (Korean Peninsula Infrastructure Special Committee, Korea Institute of Civil Engineering and Building Technology) ;
  • Jin-Hee Ahn (Korean Peninsula Infrastructure Special Committee, Korea Institute of Civil Engineering and Building Technology) ;
  • Kyung-Taek Koh (Korean Peninsula Infrastructure Special Committee, Korea Institute of Civil Engineering and Building Technology) ;
  • YoungSeok Kim (Northern Infrastructure Specialized Team, Korea Institute of Civil Engineering and Building Technology)
  • Received : 2023.06.05
  • Accepted : 2023.06.19
  • Published : 2023.06.30

Abstract

In this paper, we propose the technique of optimal search keyword extraction and retrieval for news article classification. The proposed technique was verified as an example of identifying trends related to North Korean construction. A representative Korean media platform, BigKinds, was used to select sample articles and extract keywords. The extracted keywords were vectorized using word embedding and based on this, the similarity between the extracted keywords was examined through cosine similarity. In addition, words with a similarity of 0.5 or higher were clustered based on the top 10 frequencies. Each cluster was formed as 'OR' between keywords inside the cluster and 'AND' between clusters according to the search form of the BigKinds. As a result of the in-depth analysis, it was confirmed that meaningful articles appropriate for the original purpose were extracted. This paper is significant in that it is possible to classify news articles suitable for the user's specific purpose without modifying the existing classification system and search form.

본 논문에서는 자료 조사를 위한 최적의 키워드 추출 및 검색 방법을 제안하였으며, 북한 건설 관련 동향 파악을 예시로 제안 방법을 검증하였다. 대표적인 국내 언론 플랫폼인 빅카인즈(BigKinds)를 활용하여 표본 기사를 선정하고 키워드를 추출하였다. 추출된 키워드는 워드 임베딩(Word Embedding)을 활용하여 벡터화하였으며, 이를 토대로 코사인 유사도(Cosine Similarity)를 통해 추출된 키워드 간의 유사도를 검사하였다. 또한 상위 빈도수 10개에 대한 키워드를 기준으로 유사도 0.5 이상인 키워드들을 군집화하였다. 각 군집들은 빅카인즈 검색 양식에 맞추어 군집 내부 키워드 간에는 'OR', 군집 간에는 'AND'로 형성하였다. 심층 분석 결과, 본래 목적에 맞는 유의미한 기사들이 추출되었음을 확인할 수 있었다. 기존의 분류체계 및 검색 양식을 변형시키지 않은 상태에서 사용자의 세부 목적을 충족시키는 자료 조사·분류가 가능하게 되었다는 점에서 의의를 갖는다.

Keywords

Acknowledgement

Research for this paper was carried out under the KICT Research Program(20230068-001, Research on the establishment of integrated and linked infrastructure for the co-prosperity of South and North Korea) funded by the Ministry of Science and ICT.

References

  1. An, J. and Kim, H. W. (2015), "Building a Korean Sentiment Lexicon Using Collective Intelligence", Journal of Intelligence and Information Systems, Vol.21, No.2, pp.49-67. https://doi.org/10.13088/jiis.2015.21.2.49
  2. Bigkinds (2023), http://www.bigkinds.or.kr.
  3. Cheong, Y., Wang, G. and Song, S. (2020), "A Deep Learning-based Analysis of Ideological Words in Rodong Sinmun", Korean Linguistics, Vol.88, pp.213-245. https://doi.org/10.20405/kl.2020.08.88.213
  4. Choi, G. and Choi, S. P. (2018), "A Study on the Deduction of Social Issues Applying Word Embedding: With an Empasis on News Articles related to the Disables", Journal of the Korean Society for Information Management, Vol.35, No.1, pp.231-250.
  5. Choi, Y. and Choi, S. P. (2019), "A Study on Patent Literature Classification Using Distributed Representation of Technical Terms", Journal of the Korean Society for Library and Information Science, Vol.53, No.2, pp.179-199.
  6. Chung, S., Moon, S. and Choi, S. (2018), "Bridge Damage Factor Recognition from Inspection Reports Usin Deep Learning", Journal of the Korean Society of Civil Engineers, Vol.38, No.4, pp.621-625. https://doi.org/10.12652/KSCE.2018.38.4.0621
  7. Harris, Z. S. (1954), "Distributional Structure", WORD, Vol.10, No.2-3, pp.146-162. https://doi.org/10.1080/00437956.1954.11659520
  8. Kim, K., Kang, K., Son, M., Lee, C., Hong, S. and Kim, S. (2020), "A Big-Data Analysis of Issues on North Korea and Media Agenda Setting Functions: Applying Topic Modeling and Word-embedding Methods", Peace and Democracy Institute, Vol.28, No.1, pp.287-33.
  9. Kim, K. O. (2020), "Analysis of Research Trends in Consumer Science through Text Mining", Journal of Consumer Studies, Vol.31, No.5, pp.19-47. https://doi.org/10.35736/JCS.31.5.2
  10. Kim, N. and Kim, H. J. (2017), "A Study on the Law2Vec Model for Searching Related Law", Journal of Digital Contents Society, Vol.18, No.7, pp.1419-1425.
  11. Park, K. (2023), Pre-trained word vectors of 30+ languages, https://github.com/Kyubyong/wordvectors.
  12. Rong, X. (2014), Word2vec parameter learning explained, Computation and Language(cs.CL).
  13. Song, J. and Lee, J. K. (2018), "Approach to Word Embedding-based Semantic Analysis of Building Rule Checking-related Sentences for the Automated Rule Checking", Korean Journal of Computational Design and Engineering, Vol.23, No.4, pp.384-393. https://doi.org/10.7315/CDE.2018.384
  14. Yang, Y. J., Lee, B. H., Kim, J. S., and Lee, K. Y. (2019), "Development of An Automatic Classification System for Game Reviews Based on Word Embedding and Vector Similarity", The Journal of Society for e-Business Studies, Vol.24, No.2, pp.1-14. https://doi.org/10.7838/JSEBS.2019.24.2.001
  15. Yoo, S. H. and Sung, S. (2021), "Methodology for Semantic R&D Knowledge Clustering Analysis through Data Similarity Analysis: Entrepreneurship Research Field Study", Journal of Business Research, Vol.36, No.3, pp.167-180.
  16. Yoo, W. and An, S. (2023), WikiDocs, https://wikidocs.net/book/2155.