Browse > Article
http://dx.doi.org/10.9717/kmms.2020.23.4.595

Generative probabilistic model with Dirichlet prior distribution for similarity analysis of research topic  

Milyahilu, John (Dept. of IT Convergence & Application Eng. Pukyong National University)
Kim, Jong Nam (Dept. of IT Convergence & Application Eng. Pukyong National University)
Publication Information
Abstract
We propose a generative probabilistic model with Dirichlet prior distribution for topic modeling and text similarity analysis. It assigns a topic and calculates text correlation between documents within a corpus. It also provides posterior probabilities that are assigned to each topic of a document based on the prior distribution in the corpus. We then present a Gibbs sampling algorithm for inference about the posterior distribution and compute text correlation among 50 abstracts from the papers published by IEEE. We also conduct a supervised learning to set a benchmark that justifies the performance of the LDA (Latent Dirichlet Allocation). The experiments show that the accuracy for topic assignment to a certain document is 76% for LDA. The results for supervised learning show the accuracy of 61%, the precision of 93% and the f1-score of 96%. A discussion for experimental results indicates a thorough justification based on probabilities, distributions, evaluation metrics and correlation coefficients with respect to topic assignment.
Keywords
Gibbs Sampling; Corpus; Machine Learning; Probabilistic Model; Topics;
Citations & Related Records
연도 인용수 순위
  • Reference
1 M. Nam, E. Lee, and J. Shin, "A Method for User Sentiment Classification using Instagram Hashtags," Journal of Korea Multimedia Society, Vol. 18, No. 11, pp. 1391-1399, 2015.   DOI
2 A. Beykikhoshk, O. Arandjelovic, D. Phung, and S. Venkatesh, "Discovering Topic Structures of a Temporary Evolving Document Corpus," Knowledge and Information Systems, Vol. 55, pp. 599-632, 2018.   DOI
3 K. Kowsari, K.J. Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, “Text Classification Algorithms: A Survey,” Information, Vol. 10, No. 3, pp. 1-68, 2019.
4 N. Pladeau, and E. Davoodi, "Comparison of Latent Dirichlet Modeling and Factor Analysis for Topic Extraction," Proceeding of the Hawaii International Conference on System Sciences, pp. 615-623, 2018.
5 J. Rashid, S.M. Shah, A. Irtaza, T. Mahmood, M. Shafiq, and A. Gardezi, "Topic Modeling Technique for Text Mining Over Biomedical Text Corpora Through Hybrid Inverse Documents Frequence and Fuzzy K-Means Clustering," IEEE Access, Vol. 7, pp. 146070-146080, 2019.   DOI
6 J. Clark, and F. Provost, "Unsupervised Dimension Reduction versus Supervised Re-gularization for Classification from Sparse Data," Data Mining and Knowledge Discovery, Vol. 33, pp. 871-916, 2019.   DOI
7 H. Yuening, J.B. Graber, B. Satinoff, and A. Smith, "Interactive Topic Modeling," Machine Learning, Vol. 95, pp. 423-469, 2014.   DOI
8 H. Jelodar, Y. Wang, C. Yuan, X. Feng, X. Jiang, Y. Li, et al., "Latent Dirichlet Allocation (LDA) and Topic Modeling: Models, Application, A Survey," Multimedia Tools and Applications, Vol. 78, pp. 15169-15211, 2019.   DOI
9 Z. Gou, Z. Huo, K. Vakili, Y. Liu, and Y. Yang, "A Method for Constructing Supervised Topic Model Based on Term Frequency-Inverse Topic Frequency," Symmetry, Vol. 11, pp. 1-9, 2019.   DOI
10 T.R. Hannigan, R.F. Haans, K. Vakili, H. Tchalian, V.L. Glaser, M.S. Wang, et al., "Topic Modeling in Management Research," Academy of Management Annals, Vol. 13, No. 2, pp. 586-632, 2019.   DOI
11 D.M. Blei, A.Y. Ng, and M.I. Jordan, "A Latent Dirichlet Allocation," Journal of Machine Learning Research, Vol. 3, pp. 993-1022, 2003.
12 S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman, "Indexing by Latent Semantic Analysis," Journal of American Society for Information Science, Vol. 41, No. 6, pp. 391-407, 1990.   DOI
13 T. Hofmann, "Probabilistic Latent Semantic Indexing," Proceeding of International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50-57, 1999.
14 M.K. Dalal, X. Chuangbai, I. Izahar, and T. Shanshan, "Automatic Text Classification: A Technical Review," International Journal of Computer Applications, Vol. 28, No, 2, pp. 37-40, 2011.   DOI
15 H. Sundus, and R. Muhammed, and M. Shaikh, "Comparing SVM and Naive Bayes Classifiers for Text Categorization with Wikitology as Knowledge Enrichment," Proceeding of IEEE International Conference, pp. 1-3, 2012.
16 M.L. Prabha, and G.U. Srikanth, "Survey of Sentiment Analysis Using Deep Learning Techniques," Proceeding of International Conference on Innovations, in Information and Communication Technology, pp. 1-9, 2019.
17 R.K. Roul, J.K. Sahoo, and K. Arora, "Modified TF-IDF Term Weighting Strategies for Text Categorization," Proceeding of IEEE India Council International Conference, pp. 1-6, 2017.