Browse > Article
http://dx.doi.org/10.9708/jksci.2021.26.04.105

A Study on the Classification of Unstructured Data through Morpheme Analysis  

Kim, SungJin (Dept. of Multimedia Engineering, GangNeung-Wonju National University)
Choi, NakJin (Dept. of Multimedia Engineering, GangNeung-Wonju National University)
Lee, JunDong (Dept. of Multimedia Engineering, GangNeung-Wonju National University)
Abstract
In the era of big data, interest in data is exploding. In particular, the development of the Internet and social media has led to the creation of new data, enabling the realization of the era of big data and artificial intelligence and opening a new chapter in convergence technology. Also, in the past, there are many demands for analysis of data that could not be handled by programs. In this paper, an analysis model was designed and verified for classification of unstructured data, which is often required in the era of big data. Data crawled DBPia's thesis summary, main words, and sub-keyword, and created a database using KoNLP's data dictionary, and tokenized words through morpheme analysis. In addition, nouns were extracted using KAIST's 9 part-of-speech classification system, TF-IDF values were generated, and an analysis dataset was created by combining training data and Y values. Finally, The adequacy of classification was measured by applying three analysis algorithms(random forest, SVM, decision tree) to the generated analysis dataset. The classification model technique proposed in this paper can be usefully used in various fields such as civil complaint classification analysis and text-related analysis in addition to thesis classification.
Keywords
Big Data; Data Analysis; Visualization; Textmining; Modeling;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Hsu, Daniel, Sham M. Kakade, and Tong Zhang (2008). "A spectral algorithm for learning hidden markov models.". 《arXiv preprint arXiv:0811.4413》   DOI
2 Manning, C. D.; Raghavan, P.; Schutze, H. 《Introduction to Information Retrieval》. Cambridge University Press. 100-123. ISBN 9780521865715. 2008 Scoring, term weighting, and the vector space model
3 Douglas, Laney. " 3D Data Management: Controlling Data Volume, Velocity and Variety ." Gartner. Retrieved February 6, 2001
4 Beom Jiin, Choi Sungjong, "Bigdata use cases and implications", CEO Focus Vol. 312, 2013
5 EunSoon You, GunHee, Choi, SeungHoon Kim "Study on Extraction of Keywords Using TF-IDF and Text Structure of Novels", Journal of The Korea Society of Computer and Information Vol. 20, No. 2, February 2015
6 Mary Meeker's 2016 internet trends report
7 Kaminski, B.; Jakubczyk, M.; Szufel, P. (2017). "A framework for sensitivity analysis of decision trees". 《Central European Journal of Operations Research》. doi:10.1007/s10100-017-0479-6   DOI
8 Park Jooseok "A Comparative Study of Big Data, Open Data, and My Data", Korea Bigdata Society, 41-46, No 3, Vol. 23, 2018   DOI
9 Liaw, Andy March 25, 2018. "Documentation for R package randomForest"
10 Kim HyunJong, Lee TaiHun, Ryu SeungEui, Kim NaRang "A Study on Text Mining Methods to Analyze Civil Complaints: Structured Association Analysis", Journal of the Korea Industrial Information Systems Research Vol. 23 No. 3, 2018.6
11 Barnett, T. P., and R. Preisendorfer. (1987). "Origins and levels of monthly and seasonal forecast skill for United States surface air temperatures determined by canonical correlation analysis.". 《Monthly Weather Review 115》
12 Cho ByungSun "A Comparative Study on Requirements Analysis Techniques using Natural Language Processing and Machine Learning", Ajou Univ. 2020.
13 Bryan Bischof. Higher order co-occurrence tensors for hypergraphs via face-splitting. Published 15 February, 2020, Mathematics, Computer Science, ArXiv
14 Key-Sun Choi, Young S. Han, Young G. Han, Oh W. Kwon, KAIST tree bank project for Korean: Present and future development, In Proceedings of the International Workshop on Sharable Natural Language Resources, pp. 7-14, 1994
15 Cho Taeho "Concepts and Applications of Text Mining", Journal of scientific & technological knowledge infrastructure no.5, 2001, pp.76 - 85
16 Leo Breiman (2001). "Random Forests". 《Machine Learning》 45 (1): 5-32. doi:10.1023/A:1010933404324   DOI
17 "Regression analysis"《Encyclopedia of Mathematics》. Springer-Verlag. 2001. ISBN 978-1-55608-010-4.
18 Choi YunJeong, Park SeungSoo "Interplay of Text Mining and Data Mining for Classifying Web Contents" The Korea Society for Cognitive Science 13(3), 33-46, 2002
19 HyunJin Yeo "Mobile Commerce Brand Identity Strategy by SNS Text mining", Journal of The Korea Society of Computer and Information, Vol. 25 No. 10, October 2020
20 Hello data science - www.hellodatascience.com Jinyoung Kim
21 Data collection - www.dbguide.net KOREA Data Agency