Browse > Article
http://dx.doi.org/10.9708/jksci.2020.25.10.087

Efficient Keyword Extraction from Social Big Data Based on Cohesion Scoring  

Kim, Hyeon Gyu (Div. of Computer Science and Engineering, Sahmyook University)
Abstract
Social reviews such as SNS feeds and blog articles have been widely used to extract keywords reflecting opinions and complaints from users' perspective, and often include proper nouns or new words reflecting recent trends. In general, these words are not included in a dictionary, so conventional morphological analyzers may not detect and extract those words from the reviews properly. In addition, due to their high processing time, it is inadequate to provide analysis results in a timely manner. This paper presents a method for efficient keyword extraction from social reviews based on the notion of cohesion scoring. Cohesion scores can be calculated based on word frequencies, so keyword extraction can be performed without a dictionary when using it. On the other hand, their accuracy can be degraded when input data with poor spacing is given. Regarding this, an algorithm is presented which improves the existing cohesion scoring mechanism using the structure of a word tree. Our experiment results show that it took only 0.008 seconds to extract keywords from 1,000 reviews in the proposed method while resulting in 15.5% error ratio which is better than the existing morphological analyzers.
Keywords
Big data; Social reviews; Keyword extraction; Cohesion score; Morphological analysis;
Citations & Related Records
Times Cited By KSCI : 6  (Citation Analysis)
연도 인용수 순위
1 W. L. Kang, H. G. Kim, and Y, J. Lee, "Reducing IO Cost in OLAP Query Processing with MapReduce," IEICE Trans. Inf. & Syst, Vol. E98-D, No. 2, pp. 444-447, Feb. 2015.   DOI
2 K. H. Lee et al., "Parallel Data Processing with MapReduce: a Survey," ACM SIGMOD Record, Vol. 40, No. 4, pp. 11-20, 2012.   DOI
3 IDC Korea, https://www.idc.com/getdoc.jsp?containerId=prAP45938720
4 Naver Open API, https://developers.naver.com/docs/common/open apiguide/
5 Google Developer API, https://developers.google.com/
6 Hannanum, http://semanticweb.kaist.ac.kr/hannanum/index.html
7 Kokoma, http://kkma.snu.ac.kr/documents/index.jsp
8 H. G. Kim, "Developing a Big Data Analysis Platform for Small and Medium-Sized Enterprises," Journal of the Korea Society of Computer and Information, Vol. 25, No. 8, Aug. 2020.
9 H. J. Kim and S. J. Cho, "Cleansing Noisy Text Using Corpus Extraction and String Match," MS. Thesis, Seoul National University, 2013.
10 H. G. Seo and H. W. Park, "Design and Implementation of Potential Advertisement Keyword Extraction System Using SNS," Journal of the Korea Convergence Society, Vol. 9, No. 7, pp. 14-24, 2018.
11 O. J. Lee, S. B. Park, D. Chung, and E. S. You, "Movie Box-Office Analysis Using Social Big Data," Journal of the Korea Contents Society, Vol. 14, No. 10, pp. 527-538, 2014.
12 C. Lee, D. Choi, S. Kim, and J. Kang, "Classification and Analysis of Emotion in Korean Microblog Texts," Journal of KIISE, Vol. 40, No. 3, pp. 159-167, Jun. 2013.
13 S. H. Yang and Y. S. Kim, "A High-Speed Korean Morphological Analysis Method based on Pre-Analyzed Partial Words," Journal of KIISE, Vol. 27, No. 3, pp. 290-301, 2000.
14 J. Y. Chang, "A Sentiment Analysis Algorithm for Automatic Product Reviews Classification in Online Shop ping Mall," Vol. 14, No. 4, pp. 19-32, 2009.
15 H. Lim, B. Yoon, and H. Lim, "An Efficient Korean Morphological Analyzer using Exclusive Information," Journal of KIISE, Vol. 22, No. 6, pp. 957-964, 1995.
16 Y. Kim, M. Park, J. Choi, and H. Kwon, "Improvement of Analysis Speed in Korean Morphologlcal Analyzer Using Ameliorated Dictionary," Proc. of the 11th Hangul and Korean Information Processing, pp. 479-483, 1999.
17 Z. Jin and K Tanaka-Ishii, "Unsupervised Segmentatino of Chinese Text by Use of Branching Entropy," The Journal of Korea Navigation Institute, pp. 428-435, Jul. 2006.
18 Soynlp, https://github.com/lovit/soynlp
19 E. Kim, "The Unsupervised Learning-based Language Modeling of Word Comprehension in Korean," Journal of the Korea Society of Computer and Information, Vol. 24, No. 11, pp. 41-49, Nov. 2019.
20 Cohesion Score, https://lovit.github.io/nlp/2018/04/09/cohesion _ltokenizer/