Construction of Onion Sentiment Dictionary using Cluster Analysis

군집분석을 이용한 양파 감성사전 구축

  • Oh, Seungwon (Department of Statistics, Chonnam National University) ;
  • Kim, Min Soo (Department of Statistics, Chonnam National University)
  • Received : 2018.11.09
  • Accepted : 2018.12.18
  • Published : 2018.12.31

Abstract

Many researches are accomplished as a result of the efforts of developing the production predicting model to solve the supply imbalance of onions which are vegetables very closely related to Korean food. But considering the possibility of storing onions, it is very difficult to solve the supply imbalance of onions only with predicting the production. So, this paper's purpose is trying to build a sentiment dictionary to predict the price of onions by using the internet articles which include the informations about the production of onions and various factors of the price, and these articles are very easy to access on our daily lives. Articles about onions are from 2012 to 2016, using TF-IDF for comparing with four kinds of TF-IDFs through the documents classification of wholesale prices of onions. As a result of classifying the positive/negative words for price by k-means clustering, DBSCAN (density based spatial cluster application with noise) clustering, GMM (Gaussian mixture model) clustering which are partitional clustering, GMM clustering is composed with three meaningful dictionaries. To compare the reasonability of these built dictionary, applying classified articles about the rise and drop of the price on logistic regression, and it shows 85.7% accuracy.

우리나라 식생활에 밀접한 관련을 가지고 있는 채소인 양파의 수급불균형 해결을 위한 생산량 예측 모형 개발의 노력이 많은 연구를 통해 이뤄지고 있다. 하지만 양파의 수확기와 저장 가능성을 고려해 봤을 때 생산량 예측만으로는 수급불균형 해결이 어렵다. 따라서 본 논문에서는 양파의 생산량 정보와 가격의 다양한 요인이 포함되어 있으며 일상에서 쉽게 접할 수 있는 인터넷 기사를 이용하여 가격 예측을 위한 감성사전을 구축하고자 한다. 양파 기사는 2012년부터 2016년까지의 데이터를 사용하였고 도매시장 가격을 통한 문서구분을 통해 4가지 TF-IDF를 비교하여 적합한 TF-IDF를 사용하였다. 분석을 위하여 분할적 군집분석 중 k-means 군집, 밀도기반군집(DBSCAN; density based spatial cluster applications with noise), 가우시안혼합분포군집(GMM; Gaussian mixture model) 군집을 통하여 가격에 대한 긍정/부정 단어를 구분한 결과 GMM 군집이 의미 있는 긍정, 부정, 무정의 3개의 사전으로 구성되었다. 구축된 사전의 합리성을 비교하기 위하여 가격 상승 기사와 가격 하락 기사의 분류에 로지스틱 회귀분석을 적용한 결과 85.7%의 정확도로 구축된 사전의 합리성을 확인할 수 있었다.

Keywords

Acknowledgement

Supported by : 한국연구재단

References

  1. Billard, L., Dai, Y. (1998). A space-time bilinear model and its identification, Journal of Time Series Analysis, 19, 657-679. https://doi.org/10.1111/1467-9892.00115
  2. Choi, S. C., Baek, J. (2016). Crop yields estimation using spatial panel regression model, The Korean Journal of Applied Statistics, 29(5), 873-885. (in Korean). https://doi.org/10.5351/KJAS.2016.29.5.873
  3. Choi, H. Y., Jeong, H. C. (2015). Multivariate time series modeling for information security data, Journal of the Korean Data Analysis Society, 17(3), 1309-1318. (in Korean).
  4. Dickey, D. A., Fuller, W. A. (1979). Distribution of the estimation for autoregressive time series with a unit root, Journal of the American Statistical Association, 74, 427-431.
  5. Han, J., Cho, H. (2018). A study on cluster analysis of mixed data with continuous and categorical variables, Journal of the Korean Data Analysis Society, 20(4), 1769-1780. (in Korean).
  6. Jeon, Y., Kim, D., Kim, B. (2014). The effects of capital structure on firm performance in municipal development corporations: panel data regression and 2SLS analysis, Journal of the Korean Data Analysis Society, 16(6), 3161-3174. (in Korean).
  7. Kim, B. (2017). A time-series analysis of rating standards for corporate bonds, Journal of the Korean Data Analysis Society, 19(1B), 413-423. (in Korean).
  8. Kim, H. (2011). The application of time series analysis under R environment, Journal of the Korean Data Analysis Society, 13(1), 331-341. (in Korean).
  9. Kim, Y., Chong, Y. (2006). A time series analysis on the interrelation between the housing price and the macroeconomic variables, Journal of the Korean Data Analysis Society, 8(6), 2383-2398. (in Korean).
  10. Kim, Y., Kim, N., Jeon, S. R. (2012). Stock-index invest model using news big data opinion mining, Korean Intelligent Information System Society, 18(2), 143-156. (in Korean).
  11. Kim, A., Pak, R.-J. (2014). A study on traffic forecasting model using vector auto regressive near Singal junction, Journal of the Korean Data Analysis Society, 16(1), 173-185. (in Korean).
  12. Lee, S., Kim, H. J. (2009). Keyword extraction from news corpus using modified TF-IDF, Society for e-Business Studies, 14(4), 59-73. (in Korean).
  13. Nam, K. H., Choe, Y. C. (2015). A study on onion wholesale price forecasting model, Journal of Agricultural Extension & Community Development, 22(4), 423-434. https://doi.org/10.12653/jecd.2015.22.4.0423
  14. Oh, S. (2018). Construction of onion sentiment dictionary using text mining, Master Thesis, Dept. of Statistics, Graduate School, Chonnam National University.
  15. Park, Y. J., Kim, H. S., Kim, D., Lee, H., Kim, S. B., Kang, P. (2017). A deep learning-based sports player evaluation model based on game statistics and news articles, Knowledge-Based Systems, 138, 15-26. https://doi.org/10.1016/j.knosys.2017.09.028
  16. Son, G., Byon, J.-Y., Lee, J.-H. (2015). The classification of marine forecast zone and their characteristics around the Korean peninsula using cluster analysis, Journal of the Korean Data Analysis Society, 17(4B), 2129-2138. (in Korean).
  17. Song, J., Lee, S. (2011). Automatic construction of positive/negative feature-predicate dictionary for polarity classification of product reviews, Korean Information Science Society, 38(3), 157-168. (in Korean).
  18. Yu, E., Kim, Y., Kim, N., Jeong, S. R. (2013). Predicting the direction of the stock index by using a domainspecific sentiment dictionary, Korean Intelligent Information System Society, 19(1), 95-110. (in Korean). https://doi.org/10.13088/jiis.2013.19.1.095