Browse > Article
http://dx.doi.org/10.3743/KOSIM.2022.39.3.099

Topic Model Augmentation and Extension Method using LDA and BERTopic  

Kim, SeonWook (경북대학교 사회과학대학 문헌정보학과)
Yang, Kiduk (영남고문헌아카이브센터)
Publication Information
Journal of the Korean Society for information Management / v.39, no.3, 2022 , pp. 99-132 More about this Journal
Abstract
The purpose of this study is to propose AET (Augmented and Extended Topics), a novel method of synthesizing both LDA and BERTopic results, and to analyze the recently published LIS articles as an experimental approach. To achieve the purpose of this study, 55,442 abstracts from 85 LIS journals within the WoS database, which spans from January 2001 to October 2021, were analyzed. AET first constructs a WORD2VEC-based cosine similarity matrix between LDA and BERTopic results, extracts AT (Augmented Topics) by repeating the matrix reordering and segmentation procedures as long as their semantic relations are still valid, and finally determines ET (Extended Topics) by removing any LDA related residual subtopics from the matrix and ordering the rest of them by F1 (BERTopic topic size rank, Inverse cosine similarity rank). AET, by comparing with the baseline LDA result, shows that AT has effectively concretized the original LDA topic model and ET has discovered new meaningful topics that LDA didn't. When it comes to the qualitative performance evaluation, AT performs better than LDA while ET shows similar performances except in a few cases.
Keywords
LDA; BERT; BERTopic; WORD2VEC; AET library and information science; research trends; topic modeling; matrix reordering; synthesis; LDA; BERT; BERTopic; WORD2VEC; AET;
Citations & Related Records
Times Cited By KSCI : 9  (Citation Analysis)
연도 인용수 순위
1 Kang, Bora & Kim, Heesop (2017). An analysis of the digital library research trends in Korea. Journal of the Korean Society for Information Management, 34(3), 49-66. https://doi.org/10.3743/KOSIM.2017.34.3.049   DOI
2 Kim, Dong-Kee, Han, Mooyoung, & Han, Hae-Ree (2004). Use and misuse of biostatistical analysis. Korean Neuropsychiatric Association, 43(2), 141-147.
3 Lee, Yubin, Lee, Youngho, Seong, Jeongchang, Ana, Stanescu, Ji, Sanghoon, & Hwang, Chul-Sue (2020). An analysis of the latest trends and topics in geography research using topic modeling. Journal of the Korean Geographical Society, 55(6), 589-599. http://doi.org/10.22776/kgs.2020.55.6.589   DOI
4 Lim, Sora & Kwon, YongJin (2017). IPC multi-label classification based on functional characteristics of fields in patent documents. Journal of Internet Computing and Services, 18(1), 77-88. https://doi.org/10.7472/jksii.2017.18.1.77   DOI
5 Park, Jong-Do (2019). A study on issue tracking on multi-cultural studies using topic modeling. Journal of the Korean Library and Information Science, 53(3), 273-289. https://doi.org/10.4275/KSLIS.2019.53.3.273   DOI
6 Park, Junhyeong & Oh, Hyo-Jung (2017). Comparison of topic modeling methods for analyzing research trends of archives management in Korea: focused on LDA and HDP. Journal of Korean Library and Information Science Society, 48(4), 235-258. https://doi.org/10.16981/kliss.48.201712.235   DOI
7 Park, Soonwook, Kim, Youngkook, & Kim, Myungho (2021). A design for intent classification modelswith covid-19 disaster alerts data. Proceedings of the 2021 Korea Computer Congress, 1810-1812.
8 Wang, Y., Shi, Z., Guo, X., Liu, X., Zhu, E., & Yin, J. (2018). Deep embedding for determining the number of clusters. Proceedings of the Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence, 32(1). https://doi.org/10.1609/aaai.v32i1.12150   DOI
9 Choi, Won-Jun, Seol, Jae-Wook, Jeong, Hee-Seok, & Yoon, Hwamook (2018). Comparison and analysis of subject classification for domestic research data. The Journal of the Korea Contents Association, 18(8), 178-186. http://doi.org/10.5392/JKCA.2018.18.08.178   DOI
10 M'sik, B. & Casablanca, B. M. (2020). Topic modeling coherence: A comparative study between lda and nmf models using covid'19 corpus. International Journal, 9(4). https://doi.org/10.30534/ijatcse/2020/231942020   DOI
11 Bae, Jangseong, Lee, Changki, Lim, Soojong, & Kim, Hyunki (2020). Korean semantic role labeling with BERT. Journal of Korean Institute of Information Scientists and Engineers, 47(11), 1021-1026. https://doi.org/10.5626/JOK.2020.47.11.1021   DOI
12 Mikolov, T., Chen, K., Corrado, G. S., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. ICLR. https://doi.org/10.48550/arXiv.1301.3781   DOI
13 Vayansky, I. & Kumar, S. A. (2020). A review of topic modeling methods. Information Systems, 94, 101582. https://doi.org/10.1016/j.is.2020.101582   DOI
14 Yang, Kiduk, Kim, SeonWook, & Lee, HyeKyung (2021). Comparison of research performance between domestic and international library and information science scholars. Journal of the Korean Library and Information Science, 55(1), 365-392. http://dx.doi.org/10.4275/KSLIS.2021.55.1.365   DOI
15 Yoon, Sang-Hun & Kim, Keun-Hyung (2021). Expansion of topic modeling with Word2Vec and case analysis. The Journal of Information Systems, 30(1), 45-64. http://dx.doi.org/10.5859/KAIS.2021.30.1.45   DOI
16 Ajayi, D. (2020). How BERT and GPT models change the game for NLP. Available: https://www.ibm.com/blogs/watson/2020/12/how-bert-and-gpt-models-change-the-game-for-nlp/
17 Lee, Da-Bin & Choi, Sung-Pil (2019). Comparative analysis of various Korean morpheme embedding models using massive textual resources. Journal of Korean Institute of Information Scientists and Engineers, 46(5), 413-418. http://doi.org/10.5626/JOK.2019.46.5.413   DOI
18 Park, ChangUn & Kim, HyungJung (2015). Measurement of inter-rater reliability in systematic review. Hanyang Medical Reviews, 35(1), 44-49. https://doi.org/10.7599/hmr.2015.35.1.44   DOI
19 Yang, K., Lee, H., Kim, S., Lee, J., & Oh, D.-G. (2021). KCI vs. WoS: comparative analysis of Korean and international journal publications in library and information science. Journal of Information Science Theory and Practice, 9(3), 76-106. https://doi.org/10.1633/JISTAP.2021.9.3.6   DOI
20 Song, Eun-Young, Choi, Hoe-Ryeon, & Lee, Hong-Chul (2019). A study on efficient training method for named entity recognition model with word embedding applied to PCA. Journal of the Korean Institute of Industrial Engineers, 45(1), 30-39. https://doi.org/10.7232/JKIIE.2019.45.1.030   DOI
21 Behrisch, M., Bach, B., Riche, H. N., Schreck, T., & Fekete, J. D. (2016). Matrix reordering methods for table and network visualization. In Computer Graphics Forum, 35(3), 693-716. https://doi.org/10.1111/cgf.12935   DOI
22 Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993-1022.
23 Ermann, L., Chepelianskii, A. D., & Shepelyansky, D. L. (2012). Toward two-dimensional search engines. Journal of Physics A: Mathematical and Theoretical, 45(27), 275101.   DOI
24 Bodrunova, S. S., Orekhov, A. V., Blekanov, I. S., Lyudkevich, N. S., & Tarasov, N. A. (2020). Topic detection based on sentence embeddings and agglomerative clustering with markov moment. Future Internet, 12(9), 144. https://doi.org/10.3390/fi12090144   DOI
25 Choi, W. J. & Kim, E. (2019). A large-scale text analysis with word embeddings and topic modeling. Journal of Cognitive Science, 20(1), 147-187. http://doi.org/10.17791/jcs.2019.20.1.147   DOI
26 Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391-407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6%3C391::AID-ASI1%3E3.0.CO;2-9   DOI
27 Esposito, F., Corazza, A., & Cutugno, F. (2016, December). Topic Modelling with Word Embeddings. CLiC-it/EVALITA. https://doi.org/10.4000/books.aaccademia.1666   DOI
28 Gerlach, M., Peixoto, T. P., & Altmann, E. G. (2018). A network approach to topic models. Science Advances, 4(7), eaaq1360. https://doi.org/10.1126/sciadv.aaq1360   DOI
29 Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. https://doi.org/10.48550/arXiv.2203.05794   DOI
30 Moody, C. E. (2016). Mixing dirichlet topic models and word embeddings to make lda2vec. https://doi.org/10.48550/arXiv.1605.02019   DOI
31 Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. https://doi.org/10.48550/arXiv.1802.05365   DOI
32 Li, C., Lu, Y., Wu, J., Zhang, Y., Xia, Z., Wang, T., Yu, D., Chen, X., Liu, P., & Guo, J. (2018, April). LDA meets Word2Vec: a novel model for academic abstract clustering. Companion Proceedings of the Web Conference 2018, 1699-1706. https://doi.org/10.1145/3184558.3191629   DOI
33 Schick, T. & Schutze, H. (2019). BERTRAM: Improved word embeddings have big impact on contextualized model performance. https://doi.org/10.48550/arXiv.1910.07181   DOI
34 Bengio, Y., Ducharme, R., & Vincent, P. (2000). A neural probabilistic language model. Advances in Neural Information Processing Systems, 13.
35 Chen, A. T., Sheble, L., & Eichler, G. (2013). Topic modeling and network visualization to explore patient experiences. Visual Analytics in Healthcare Workshop 2013.
36 Deveci, T. (2019). Sentence length in education research articles: a comparison between anglophone and turkish authors. The Linguistics Journal, 14(1), 73-100.
37 Gao, Q., Huang, X., Dong, K., Liang, Z., & Wu, J. (2022). Semantic-enhanced topic evolution analysis: a combination of the dynamic topic model and word2vec. Scientometrics, 1-21. https://doi.org/10.1007/s11192-022-04275-z   DOI
38 Hasan, M., Rahman, A., Karim, M., Khan, M., Islam, S., & Islam, M. (2021). Normalized approach to find optimal number of topics in Latent Dirichlet Allocation(LDA). In Proceedings of International Conference on Trends in Computational and Cognitive Engineering, 341-354. Springer, Singapore. http://doi.org/10.1007/978-981-33-4673-4_27   DOI
39 Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 50-57.
40 Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., & Zhao, L. (2019). Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools and Applications, 78(11), 15169-15211. https://doi.org/10.1007/s11042-018-6894-4   DOI
41 Losee, R. M. (2001). Term dependence: A basis for Luhn and Zipf models. Journal of the American Society for Information Science and Technology, 52(12), 1019-1025. https://doi.org/10.1002/asi.1155   DOI
42 Hwang, Seung-Yeon, An, YoonBin, Shin, Dong-Jin, Oh, Jae-Kon, & Moon Jin-Yong (2020). A study on the document topic extraction system based on big data. The Journal of The Institute of Internet, Broadcasting and Communication, 20(5), 207-214. http://doi.org/10.7236/JIIBC.2020.20.5.207   DOI
43 Angelov, D. (2020). Top2vec: Distributed representations of topics. https://doi.org/10.48550/arXiv.2008.09470   DOI
44 Sia, S., Dalmia, A., & Mielke, S. J. (2020). Tired of topic models? clusters of pretrained word embeddings make for fast and good topics too. https://doi.org/10.48550/arXiv.2004.14914   DOI