KOREAN TOPIC MODELING USING MATRIX DECOMPOSITION

June-Ho Lee;Hyun-Min Kim;

doi:10.7858/eamj.2024.020

East Asian mathematical journal

제40권3호
/
Pages.307-318
/
2024
/
1226-6973(pISSN)
/
2287-2833(eISSN)

영남수학회 (The Youngnam Mathematical Society)

DOI QR Code

KOREAN TOPIC MODELING USING MATRIX DECOMPOSITION

June-Ho Lee (Department of Mathematics, Pusan National University, Finance, Fishery, Manufacture Industrial Mathematics Center on Big Data, Pusan National University) ;
Hyun-Min Kim (Department of Mathematics, Pusan National University, Finance, Fishery, Manufacture Industrial Mathematics Center on Big Data, Pusan National University)

투고 : 2024.01.29
심사 : 2024.04.23
발행 : 2024.05.31

https://doi.org/10.7858/eamj.2024.020 인용 PDF

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

This paper explores the application of matrix factorization, specifically CUR decomposition, in the clustering of Korean language documents by topic. It addresses the unique challenges of Natural Language Processing (NLP) in dealing with the Korean language's distinctive features, such as agglutinative words and morphological ambiguity. The study compares the effectiveness of Latent Semantic Analysis (LSA) using CUR decomposition with the classical Singular Value Decomposition (SVD) method in the context of Korean text. Experiments are conducted using Korean Wikipedia documents and newspaper data, providing insight into the accuracy and efficiency of these techniques. The findings demonstrate the potential of CUR decomposition to improve the accuracy of document clustering in Korean, offering a valuable approach to text mining and information retrieval in agglutinative languages.

키워드

과제정보

This paper is based on the Master's thesis of the first author at Pusan National University [10].

참고문헌

E. Alsentzer, J. R. Murphy, W. Boag, W.-H. Weng, D. Jin, T. Naumann, and M. McDermott, Publicly available clinical BERT embeddings, arXiv preprint arXiv:1904.03323 (2019).
D. Araci, Finbert: Financial sentiment analysis with pre-trained language models, arXiv preprint arXiv:1908.10063 (2019).
I. Beltagy, K. Lo, and A. Cohan, SciBERT: A pretrained language model for scientific text, arXiv preprint arXiv:1903.10676 (2019).
G. G. Chowdhury, Natural language processing, Annu. Rev. Inf. Sci. Technol. 37 (2003), no. 1, 51-89. https://doi.org/10.1002/aris.1440370103
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, Indexing by latent semantic analysis, J. Am. Soc. Inf. Sci. 41 (1990), no. 6, 391-407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
S. Gururangan, A. Marasovi'c, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith, Don't Stop Pretraining: Adapt Language Models to Domains and Tasks, arXiv preprint arXiv:2004.10964 (2020).
KB Bank AI Team, KB-ALBERT-KO (2020) , GitHub repository, https://github.com/KB-Bank-AI/KB-ALBERT-KO.
J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, BioBERT: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 36 (2020), no. 4, 1234-1240. https://doi.org/10.1093/bioinformatics/btz682
JH LEE, Korean Document Clustering by Topic Using Matrix Factorizations, Master's Thesis, Pusan National University (2021).
E. D. Liddy, Natural language processing (2001), 4-6.
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019).
N. Ljubesic, D. Boras, N. Bakaric, and J. Njavro, Comparing measures of semantic similarity, ITI 2008-30th Int. Conf. Inf. Technol. Interfaces (2008), 675-682.
M. W. Mahoney and P. Drineas, CUR matrix decompositions for improved data analysis, Proc. Natl. Acad. Sci. 106 (2009), no. 3, 697-702. https://doi.org/10.1073/pnas.0803205106
M. W. Mahoney, M. Maggioni, and P. Drineas, Tensor-CUR decompositions for tensor-based data, SIAM J. Matrix Anal. Appl. 30 (2008), no. 3, 957-987. https://doi.org/10.1137/060665336
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, Deep contextualized word representations, arXiv preprint arXiv:1802.05365 (2018).
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, arXiv preprint arXiv:1910.10683 (2019).
SK T-Brain, KoBERT (2019), GitHub repository, https://github.com/SKTBrain/KoBERT.
D. C. Sorensen and M. Embree, A deim induced cur factorization, SIAM J. Sci. Comput. 38 (2016), no. 3, A1454-A1482. https://doi.org/10.1137/140978430
L. N. Trefethen and D. Bau III, Numerical linear algebra, SIAM 50 (1997), 25-36.

East Asian mathematical journal

KOREAN TOPIC MODELING USING MATRIX DECOMPOSITION

초록

키워드

과제정보

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)