KOREAN TOPIC MODELING USING MATRIX DECOMPOSITION

June-Ho Lee;Hyun-Min Kim;

doi:10.7858/eamj.2024.020

East Asian mathematical journal

Volume 40 Issue 3
/
Pages.307-318
/
2024
/
1226-6973(pISSN)
/
2287-2833(eISSN)

The Youngnam Mathematical Society (영남수학회)

DOI QR Code

KOREAN TOPIC MODELING USING MATRIX DECOMPOSITION

June-Ho Lee (Department of Mathematics, Pusan National University, Finance, Fishery, Manufacture Industrial Mathematics Center on Big Data, Pusan National University) ;
Hyun-Min Kim (Department of Mathematics, Pusan National University, Finance, Fishery, Manufacture Industrial Mathematics Center on Big Data, Pusan National University)

Received : 2024.01.29
Accepted : 2024.04.23
Published : 2024.05.31

https://doi.org/10.7858/eamj.2024.020 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

This paper explores the application of matrix factorization, specifically CUR decomposition, in the clustering of Korean language documents by topic. It addresses the unique challenges of Natural Language Processing (NLP) in dealing with the Korean language's distinctive features, such as agglutinative words and morphological ambiguity. The study compares the effectiveness of Latent Semantic Analysis (LSA) using CUR decomposition with the classical Singular Value Decomposition (SVD) method in the context of Korean text. Experiments are conducted using Korean Wikipedia documents and newspaper data, providing insight into the accuracy and efficiency of these techniques. The findings demonstrate the potential of CUR decomposition to improve the accuracy of document clustering in Korean, offering a valuable approach to text mining and information retrieval in agglutinative languages.

Keywords

Acknowledgement

This paper is based on the Master's thesis of the first author at Pusan National University [10].

References

E. Alsentzer, J. R. Murphy, W. Boag, W.-H. Weng, D. Jin, T. Naumann, and M. McDermott, Publicly available clinical BERT embeddings, arXiv preprint arXiv:1904.03323 (2019).
D. Araci, Finbert: Financial sentiment analysis with pre-trained language models, arXiv preprint arXiv:1908.10063 (2019).
I. Beltagy, K. Lo, and A. Cohan, SciBERT: A pretrained language model for scientific text, arXiv preprint arXiv:1903.10676 (2019).
G. G. Chowdhury, Natural language processing, Annu. Rev. Inf. Sci. Technol. 37 (2003), no. 1, 51-89.
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, Indexing by latent semantic analysis, J. Am. Soc. Inf. Sci. 41 (1990), no. 6, 391-407.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
S. Gururangan, A. Marasovi'c, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith, Don't Stop Pretraining: Adapt Language Models to Domains and Tasks, arXiv preprint arXiv:2004.10964 (2020).
KB Bank AI Team, KB-ALBERT-KO (2020) , GitHub repository, https://github.com/KB-Bank-AI/KB-ALBERT-KO.
J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, BioBERT: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 36 (2020), no. 4, 1234-1240.
JH LEE, Korean Document Clustering by Topic Using Matrix Factorizations, Master's Thesis, Pusan National University (2021).
E. D. Liddy, Natural language processing (2001), 4-6.
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019).
N. Ljubesic, D. Boras, N. Bakaric, and J. Njavro, Comparing measures of semantic similarity, ITI 2008-30th Int. Conf. Inf. Technol. Interfaces (2008), 675-682.
M. W. Mahoney and P. Drineas, CUR matrix decompositions for improved data analysis, Proc. Natl. Acad. Sci. 106 (2009), no. 3, 697-702.
M. W. Mahoney, M. Maggioni, and P. Drineas, Tensor-CUR decompositions for tensor-based data, SIAM J. Matrix Anal. Appl. 30 (2008), no. 3, 957-987.
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, Deep contextualized word representations, arXiv preprint arXiv:1802.05365 (2018).
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, arXiv preprint arXiv:1910.10683 (2019).
SK T-Brain, KoBERT (2019), GitHub repository, https://github.com/SKTBrain/KoBERT.
D. C. Sorensen and M. Embree, A deim induced cur factorization, SIAM J. Sci. Comput. 38 (2016), no. 3, A1454-A1482.
L. N. Trefethen and D. Bau III, Numerical linear algebra, SIAM 50 (1997), 25-36.

East Asian mathematical journal

KOREAN TOPIC MODELING USING MATRIX DECOMPOSITION

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)