Browse > Article
http://dx.doi.org/10.17703/JCCT.2022.8.6.891

A Study on the Use of Stopword Corpus for Cleansing Unstructured Text Data  

Lee, Won-Jo (Dept. of Industrial Management Eng., Ulsan College)
Publication Information
The Journal of the Convergence on Culture Technology / v.8, no.6, 2022 , pp. 891-897 More about this Journal
Abstract
In big data analysis, raw text data mostly exists in various unstructured data forms, so it becomes a structured data form that can be analyzed only after undergoing heuristic pre-processing and computer post-processing cleansing. Therefore, in this study, unnecessary elements are purified through pre-processing of the collected raw data in order to apply the wordcloud of R program, which is one of the text data analysis techniques, and stopwords are removed in the post-processing process. Then, a case study of wordcloud analysis was conducted, which calculates the frequency of occurrence of words and expresses words with high frequency as key issues. In this study, to improve the problems of the "nested stopword source code" method, which is the existing stopword processing method, using the word cloud technique of R, we propose the use of "general stopword corpus" and "user-defined stopword corpus" and conduct case analysis. The advantages and disadvantages of the proposed "unstructured data cleansing process model" are comparatively verified and presented, and the practical application of word cloud visualization analysis using the "proposed external corpus cleansing technique" is presented.
Keywords
Big Data; Text Data; Data Cleansing; Wordcloud; Visualization Analysis; Stopwords; Corpus;
Citations & Related Records
Times Cited By KSCI : 5  (Citation Analysis)
연도 인용수 순위
1 M. Han, Y. Kim, C. Lee, Analysis of News Regarding New southeastem Airport Using Text Mining Techniques, Smart Media Journal, Vol. 6, No. 1, 2017.
2 M. Han, Y. Kim, C. Lee, Analysis of News Regarding New southeastem Airport Using Text Mining Techniques, Smart Media Journal, Vol. 6, No. 1, 2017.
3 Insun Lee and 1 others, Unstructured data analysis and visualization, Korean Psychology Association, 2018.
4 Jongyong LEE, A Study on Tourism Analysis in Uijeongbu Region Using Big Data, JCCT, vol. 6, No. 1, pp. 413-419, 2020.
5 Web Mining, IT Glossary, Korea Information and Communication Technology Association
6 W. Lee, A Study on Word Cloud Techniques for Analysis of Unstructured Text Data, JCCT, vol. 6, No. 3, pp. 337-341, 2020.
7 M. Chi , S. Lin, S. Chen, C. Lin, T. Lee, Morphab1e word Clouds for Time-Varying Text Data Visualization, IEEE, 2015.
8 text mining, Biochemistry Encyclopedia
9 https://wikidocs.net/22530.
10 Giseop Noh, An Analysis on Internet Information using Real Time Search Words, JCCT, vol. 4, No. 4, pp. 337-341, 2018.
11 I. Chun, D. Park, Y. Kang, Python and data science, Saengneun Publishing, pp. 222-233, 2019.
12 W. Lee, A Study on Data Cleansing Techniques for Word Cloud Analysis of Text Data, JCCT, vol. 7, No. 4, pp. 745-750, 2021.
13 J. Lee, D. Yun, S. O, C. Lee, A Big Data Analysis of Civel Complaint Texts Using R Language, KIICE, 2020.
14 Kumar, P. Thakur, K. Gupta, and A. Pal, 2015, Text mining approach to analyse the relation between obesity and breast cancer data, ILNS
15 Jong Suk Lee and 3 others, Big data analysis of civil complaint texts using R language, 2020.
16 Sejong Oh, R data analysis for everyone, R data analysis for everyone, Hanbit Media, 2019.
17 Sunghuk Moon, Big data environment analysis and research on ways to secure global competitiveness, JCCT, vol. 5 No. 2, pp. 361-367