DOI QR코드

DOI QR Code

KOREAN TOPIC MODELING USING MATRIX DECOMPOSITION

  • June-Ho Lee (Department of Mathematics, Pusan National University, Finance, Fishery, Manufacture Industrial Mathematics Center on Big Data, Pusan National University) ;
  • Hyun-Min Kim (Department of Mathematics, Pusan National University, Finance, Fishery, Manufacture Industrial Mathematics Center on Big Data, Pusan National University)
  • Received : 2024.01.29
  • Accepted : 2024.04.23
  • Published : 2024.05.31

Abstract

This paper explores the application of matrix factorization, specifically CUR decomposition, in the clustering of Korean language documents by topic. It addresses the unique challenges of Natural Language Processing (NLP) in dealing with the Korean language's distinctive features, such as agglutinative words and morphological ambiguity. The study compares the effectiveness of Latent Semantic Analysis (LSA) using CUR decomposition with the classical Singular Value Decomposition (SVD) method in the context of Korean text. Experiments are conducted using Korean Wikipedia documents and newspaper data, providing insight into the accuracy and efficiency of these techniques. The findings demonstrate the potential of CUR decomposition to improve the accuracy of document clustering in Korean, offering a valuable approach to text mining and information retrieval in agglutinative languages.

Keywords

Acknowledgement

This paper is based on the Master's thesis of the first author at Pusan National University [10].

References

  1. E. Alsentzer, J. R. Murphy, W. Boag, W.-H. Weng, D. Jin, T. Naumann, and M. McDermott, Publicly available clinical BERT embeddings, arXiv preprint arXiv:1904.03323 (2019).
  2. D. Araci, Finbert: Financial sentiment analysis with pre-trained language models, arXiv preprint arXiv:1908.10063 (2019).
  3. I. Beltagy, K. Lo, and A. Cohan, SciBERT: A pretrained language model for scientific text, arXiv preprint arXiv:1903.10676 (2019).
  4. G. G. Chowdhury, Natural language processing, Annu. Rev. Inf. Sci. Technol. 37 (2003), no. 1, 51-89.
  5. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, Indexing by latent semantic analysis, J. Am. Soc. Inf. Sci. 41 (1990), no. 6, 391-407.
  6. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
  7. S. Gururangan, A. Marasovi'c, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith, Don't Stop Pretraining: Adapt Language Models to Domains and Tasks, arXiv preprint arXiv:2004.10964 (2020).
  8. KB Bank AI Team, KB-ALBERT-KO (2020) , GitHub repository, https://github.com/KB-Bank-AI/KB-ALBERT-KO.
  9. J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, BioBERT: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 36 (2020), no. 4, 1234-1240.
  10. JH LEE, Korean Document Clustering by Topic Using Matrix Factorizations, Master's Thesis, Pusan National University (2021).
  11. E. D. Liddy, Natural language processing (2001), 4-6.
  12. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019).
  13. N. Ljubesic, D. Boras, N. Bakaric, and J. Njavro, Comparing measures of semantic similarity, ITI 2008-30th Int. Conf. Inf. Technol. Interfaces (2008), 675-682.
  14. M. W. Mahoney and P. Drineas, CUR matrix decompositions for improved data analysis, Proc. Natl. Acad. Sci. 106 (2009), no. 3, 697-702.
  15. M. W. Mahoney, M. Maggioni, and P. Drineas, Tensor-CUR decompositions for tensor-based data, SIAM J. Matrix Anal. Appl. 30 (2008), no. 3, 957-987.
  16. M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, Deep contextualized word representations, arXiv preprint arXiv:1802.05365 (2018).
  17. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, arXiv preprint arXiv:1910.10683 (2019).
  18. SK T-Brain, KoBERT (2019), GitHub repository, https://github.com/SKTBrain/KoBERT.
  19. D. C. Sorensen and M. Embree, A deim induced cur factorization, SIAM J. Sci. Comput. 38 (2016), no. 3, A1454-A1482.
  20. L. N. Trefethen and D. Bau III, Numerical linear algebra, SIAM 50 (1997), 25-36.