DOI QR코드

DOI QR Code

LDA 및 BERTopic 기반 해외건설시장 뉴스 기사 토픽모델링 성능평가

Evaluation of Topic Modeling Performance for Overseas Construction Market Analysis Using LDA and BERTopic on News Articles

  • 백준우 (서울대학교 건설환경공학부) ;
  • 정세환 (서울대학교 건설환경공학부) ;
  • 지석호 (서울대학교 건설환경공학부)
  • 투고 : 2023.08.08
  • 심사 : 2023.09.14
  • 발행 : 2023.12.01

초록

해외건설사업 시, 현지 상황을 정확하고 빠르게 파악하는 것은 프로젝트 성공을 위해 매우 중요한 요소이다. 이는 토픽모델링을 활용한 뉴스 기사 분석을 통해 실현될 수 있다. 본 연구는 Latent Dirichlet Allocation(LDA)과 BERTopic 두 토픽모델링 기법을 활용하여 뉴스 기사를 분석하고, 최적의 기법을 찾고자 하였다. 모델링 결과로 자동생성된 토픽과 실제 문서 주제와의 일치 여부를 확인하기 위해 BBC 뉴스 기사 6,273건 을 수집하여 ground truth를 생성하고, 이를 모델링된 토픽과 비교하였다. 그 결과 LDA의 F1 score는 0.011, BERTopic은 0.244로 나타났다. 이를 통해 BERTopic이 실제 뉴스 기사의 주제를 잘 파악하며, 해외건설시장의 주요 이슈를 자동으로 이해하는 데 더욱 용이하다는 것을 확인할 수 있었다

Understanding the local conditions is a crucial factor in enhancing the success potential of overseas construction projects. This can be achieved through the analysis of news articles of the target market using topic modeling techniques. In this study, the authors aimed to analyze news articles using two topic modeling methods, namely Latent Dirichlet Allocation (LDA) and BERTopic, in order to determine the optimal approach for market condition analysis. To evaluate the alignment between the generated topics and the actual themes of the news documents, the research collected 6,273 BBC news articles, created ground truth data for individual news article topics, and finally compared this ground truth with the results of the topic modeling. The F1 score for LDA was 0.011, while BERTopic achieved a score of 0.244. These results indicate that BERTopic more accurately reflected the actual topics of news articles, making it more effective for understanding the overseas construction market.

키워드

과제정보

This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. RS-2023-00241758). This research was conducted with the support of the "National R&D Project for Smart Construction Technology (No.23SMIP-A158708-04)" funded by the Korea Agency for Infrastructure Technology Advancement under the Ministry of Land, Infrastructure and Transport.

참고문헌

  1. Abuzayed, A. and Al-Khalifa, H. (2021). "BERT for Arabic topic modeling: An experimental study on BERTopic technique." Procedia Computer Science, Elsevier, Vol. 189, pp. 191-194, https://doi.org/10.1016/j.procs.2021.05.096. 
  2. Blei, D. M. (2012). "Probabilistic topic models." Communications of the ACM, ACM, Vol. 55, No. 4, pp. 77-84, https://doi.org/10.1145/2133806.2133826. 
  3. Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). "Latent dirichlet allocation." Journal of Machine Learning Research, Vol. 3, pp. 993-1022. 
  4. Glez-Pena, D., Lourenco, A., Lopez-Fernandez, H., ReboiroJato, M. and Fdez-Riverola, F. (2014). "Web scraping technologies in an API world." Briefings in Bioinformatics, Oxford Unversity, Vol. 15, No. 5, pp. 788-797, https://doi.org/10.1093/bib/bbt026. 
  5. Goldszmidt, R. G. B., Brito, L. A. L. and de Vasconcelos, F. C. (2011). "Country effect on firm performance: A multilevel approach." Journal of Business Research, Elsevier, Vol. 64, No. 3, pp. 273-279, https://doi.org/10.1016/j.jbusres.2009.11.012. 
  6. Grootendorst, M. (2022). "BERTopic: Neural topic modeling with a class-based TF-IDF procedure." arXiv Preprint, arXiv: 2203.05794 [cs.CL], https://doi.org/10.48550/arXiv.2203.05794. 
  7. Jallan, Y., Brogan, E., Ashuri, B. and Clevenger, C. M. (2019). "Application of natural language processing and text mining to identify patterns in construction-defect litigation cases." Journal of Legal Affairs and Dispute Resolution in Engineering and Construction, ASCE, Vol. 11, No. 4, 04519024, https://doi.org/10.1061/(ASCE)LA.1943-4170.0000308. 
  8. Javernick-Will, A. N. and Scott, W. R. (2010). "Who needs to know what? Institutional knowledge and global projects." Journal of Construction Engineering and Management, ASCE, Vol. 136, No. 5, pp. 546-557, https://doi.org/10.1061/(ASCE)CO.1943-7862.0000035. 
  9. Jiang, H. C., Qiang, M. S. and Lin, P. (2016). "Finding academic concerns of the Three Gorges Project based on a topic modeling approach." Ecological Indicators, Elsevier, Vol. 60, pp. 693-701, https://doi.org/10.1016/j.ecolind.2015.08.007. 
  10. Jung, N. and Lee, G. (2019). "Automated classification of building information modeling (BIM) case studies by BIM use based on natural language processing (NLP) and unsupervised learning." Advanced Engineering Informatics, Elsevier, Vol. 41, 100917, https://doi.org/10.1016/j.aei.2019.04.007. 
  11. Moon, S., Chung, S. and Chi, S. (2018). "Topic modeling of news article about international construction market using latent dirichlet allocation." KSCE Journal of Civil and Environmental Engineering Research, KSCE, Vol. 38, No. 4, pp. 595-599, https://doi.org/10.12652/Ksce.2018.38.4.0595 (in Korean). 
  12. Newman, D., Lau, J. H., Grieser, K. and Baldwin, T. (2010). "Automatic evaluation of topic coherence." Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for ACL, ACL, Los Angeles, USA, pp. 100-108. 
  13. Stevens, K., Kegelmeyer, P., Andrzejewski, D. and Buttler, D. (2012). "Exploring topic coherence over many models and many topics." Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, ACL, Jeju Island, Korea, pp. 952-961. 
  14. Wallach, H. M., Murray, I., Salakhutdinov, R. and Mimno, D. (2009). "Evaluation methods for topic models." Proceedings of the 26th Annual International Conference on Machine Learning, Association for Computing Machinery, ACM, New York, USA, pp. 1105-1112, https://doi.org/10.1145/1553374.1553515. 
  15. Wei, X. and Croft, W. B. (2006). "LDA-based document models for ad-hoc retrieval." Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, New York, USA, pp. 178-185, https://doi.org/10.1145/1148170.1148204.