DOI QR코드

DOI QR Code

온라인 리뷰 분석을 통한 상품 평가 기준 추출: LDA 및 k-최근접 이웃 접근법을 활용하여

Product Evaluation Criteria Extraction through Online Review Analysis: Using LDA and k-Nearest Neighbor Approach

  • 이지현 (한양대학교 철학과) ;
  • 정상형 (한양대학교 경영학부) ;
  • 김준호 (한양대학교 수학과) ;
  • 민은주 (한양대학교 파이낸스경영학과) ;
  • 여운영 (한양대학교 비즈니스인포매틱스학과) ;
  • 김종우 (한양대학교 경영대학 경영학부)
  • Lee, Ji Hyeon (Department of Philosophy, College of Humanities, Hanyang University) ;
  • Jung, Sang Hyung (School of Business, Hanyang University) ;
  • Kim, Jun Ho (Department of Mathematics, College of Natural Sciences, Hanyang University) ;
  • Min, Eun Joo (School of Finance, Hanyang University) ;
  • Yeo, Un Yeong (School of Business Informatics, Hanyang University) ;
  • Kim, Jong Woo (School of Business, Hanyang University)
  • 투고 : 2019.08.19
  • 심사 : 2020.02.28
  • 발행 : 2020.03.31

초록

상품 평가 기준은 상품에 대한 속성, 가치 등을 표현한 지표로써 사용자나 기업이 상품을 측정하고 파악할 수 있게 한다. 기업이 자사 제품에 대한 객관적인 평가와 비교를 수행하기 위해서는 적절한 기준을 선정하는 것이 필수적이다. 이때, 평가 기준은 소비자들이 제품을 실제로 구매 및 사용 후 평가할 때 고려하는 제품의 특징을 반영하여야 한다. 그러나 기존에 사용되던 평가 기준은 제품마다 상이한 소비자의 의견을 반영하지 못하고 있다. 기존 연구에서는 소비자 의견이 반영된 온라인 리뷰를 통해 상품의 특징, 주제를 추출하고 이를 평가기준으로 사용했다. 하지만 여전히 상품과 연관성이 낮은 평가 기준이 추출되거나 부적절한 단어가 정제되지 않는 한계가 있다. 본 연구에서는 이를 극복하기 위해 잠재 디리클레 할당(Latent Dirichlet Allocation, LDA) 기법으로 리뷰로부터 평가 기준 후보군을 추출하고 이를 k-최근접 이웃 접근법(k-Nearest Neighbor Approach, k-NN)을 이용해 정제하는 모델을 개발하고 검증했다. 제시하는 방법은 준비 단계와 추출 단계로 이루어진다. 준비 단계에서는 워드임베딩(Word Embedding) 모델과 평가 기준 후보군을 정제하기 위한 k-NN 분류기를 생성한다. 추출 단계에서는 k-NN 분류기와 언급 비율을 이용해 평가 기준 후보군을 정제하고 최종 결과를 도출한다. 제안 모델의 성능 평가를 위해 명사 빈도 추출 모델, LDA 빈도 추출 모델, 실제 전자상거래 사이트가 제공하는 평가 기준을 세 비교 모델로 선정했다. 세 모델과의 비교를 위해 설문을 진행하고 점수화하여 결과를 검정했다. 30번의 검정 결과 26번의 결과에서 제안 모델이 우수함을 확인했다. 본 연구의 제안 모델은 전자상거래 사이트에서 리뷰 특성을 반영한 상품군 별 차원을 도출하는데 활용될 수 있고 이를 기초로 인사이트 발굴을 위한 리뷰 분석 및 활용에 크게 기여할 것이다.

Product evaluation criteria is an indicator describing attributes or values of products, which enable users or manufacturers measure and understand the products. When companies analyze their products or compare them with competitors, appropriate criteria must be selected for objective evaluation. The criteria should show the features of products that consumers considered when they purchased, used and evaluated the products. However, current evaluation criteria do not reflect different consumers' opinion from product to product. Previous studies tried to used online reviews from e-commerce sites that reflect consumer opinions to extract the features and topics of products and use them as evaluation criteria. However, there is still a limit that they produce irrelevant criteria to products due to extracted or improper words are not refined. To overcome this limitation, this research suggests LDA-k-NN model which extracts possible criteria words from online reviews by using LDA and refines them with k-nearest neighbor. Proposed approach starts with preparation phase, which is constructed with 6 steps. At first, it collects review data from e-commerce websites. Most e-commerce websites classify their selling items by high-level, middle-level, and low-level categories. Review data for preparation phase are gathered from each middle-level category and collapsed later, which is to present single high-level category. Next, nouns, adjectives, adverbs, and verbs are extracted from reviews by getting part of speech information using morpheme analysis module. After preprocessing, words per each topic from review are shown with LDA and only nouns in topic words are chosen as potential words for criteria. Then, words are tagged based on possibility of criteria for each middle-level category. Next, every tagged word is vectorized by pre-trained word embedding model. Finally, k-nearest neighbor case-based approach is used to classify each word with tags. After setting up preparation phase, criteria extraction phase is conducted with low-level categories. This phase starts with crawling reviews in the corresponding low-level category. Same preprocessing as preparation phase is conducted using morpheme analysis module and LDA. Possible criteria words are extracted by getting nouns from the data and vectorized by pre-trained word embedding model. Finally, evaluation criteria are extracted by refining possible criteria words using k-nearest neighbor approach and reference proportion of each word in the words set. To evaluate the performance of the proposed model, an experiment was conducted with review on '11st', one of the biggest e-commerce companies in Korea. Review data were from 'Electronics/Digital' section, one of high-level categories in 11st. For performance evaluation of suggested model, three other models were used for comparing with the suggested model; actual criteria of 11st, a model that extracts nouns by morpheme analysis module and refines them according to word frequency, and a model that extracts nouns from LDA topics and refines them by word frequency. The performance evaluation was set to predict evaluation criteria of 10 low-level categories with the suggested model and 3 models above. Criteria words extracted from each model were combined into a single words set and it was used for survey questionnaires. In the survey, respondents chose every item they consider as appropriate criteria for each category. Each model got its score when chosen words were extracted from that model. The suggested model had higher scores than other models in 8 out of 10 low-level categories. By conducting paired t-tests on scores of each model, we confirmed that the suggested model shows better performance in 26 tests out of 30. In addition, the suggested model was the best model in terms of accuracy. This research proposes evaluation criteria extracting method that combines topic extraction using LDA and refinement with k-nearest neighbor approach. This method overcomes the limits of previous dictionary-based models and frequency-based refinement models. This study can contribute to improve review analysis for deriving business insights in e-commerce market.

키워드

참고문헌

  1. Blei, D. M., A. Y. Ng and M. I. Jordan, "Latent Dirichlet Allocation." Journal of Machine Learning Research, Vol.3, Jan(2003), 993-1022.
  2. Blei, D. M., "Probabilistic Topic Models" Communications of the ACM, Vol.55, No.4 (2012), 77-84. https://doi.org/10.1145/2133806.2133826
  3. Bojanowski, P., E. Grave, A. Joulin and T. Mikolov, "Enriching Word Vectors with Subword Information." Transactions of the Association for Computational Linguistics, Vol. 5, No.10(2017), 135-146. https://doi.org/10.1162/tacl_a_00051
  4. Chae, S. H., J. I. Lim and J. Y. Kang, "A Comparative Analysis of Social Commerce and Open Market Using User Reviews in Korean Mobile Commerce." Journal of Intelligence and Information Systems, Vol.21, No.4(2015), 53-77. https://doi.org/10.13088/jiis.2015.21.4.053
  5. Cover, T. M. and P. Hart, "Nearest Neighbor Pattern Classification." IEEE Transactions On Information Theory, Vol.13, No.1(1967), 21-27. https://doi.org/10.1109/TIT.1967.1053964
  6. Dai, X., I. Spasic and F. Andres, "A Framework for Rating Online Reviews: Topic Modelling, Text Classification and Sentiment Analysis." ACMSE 2017 The Annual ACM Southeast Conference Featuring Multidisciplinary and Interdisciplinary Computing, At Kennesaw State University, Georgia, 2017.
  7. Dong, R., M. Schaal, M. P. O'Mahony and B. Smyth "Topic Extraction from Online Reviews for Classification and Recommendation." Proceeding of the Twenty-Third International Joint Conference on Artificial Intelligence, (2013), 1310-1316.
  8. Dudani, S. A., "The Distance-Weighted k-Nearest-Neighbor Rule." IEEE Transactions on Systems, Man, and Cybernetics, Vol.SMC-6, No.4(1976), 325-327. https://doi.org/10.1109/TSMC.1976.5408784
  9. Fix, E. and J. L. Hodges, "Nonparametric Discrimination: Consistency Properties." Report for the USAF School of Aviation Medicine, Randolph Field Report Number 4, Texas, 1951.
  10. Griffiths, T. L. and M. Steyvers, "Finding Scientific Topics." Proceedings of the National Academy of Sciences, (2004), 5228-5235.
  11. Jang, K. R., K. W. Lee and S. H. Myaeng, "Extracting Implicit Customer Viewpoints from Product Review Text." Journal of KISS : Software and Applications, Vol.41, No.5 (2014), 376-386.
  12. Jin, J., P. Ji and R. Gu, "Identifying Comparative Customer Requirements from Product Online Reviews for Competitor Analysis." Engineering Applications of Artificial Intelligence, Vol.49, No.3(2016), 61-73. https://doi.org/10.1016/j.engappai.2015.12.005
  13. Jo, H. S. and S. G. Lee, "Korean Word Embedding Using FastText." Journal of Korea Information Science Society, Vol.2017, No.12(2017), 705-707.
  14. Keller, J. M., M. R. Gray and J. A. Givens, "A Fuzzy k-Nearest Neighbor Algorithm." IEEE Transactions On systems, Man, and Cybernetics, Vol.SMC-15, No.4(1985), 580-585. https://doi.org/10.1109/TSMC.1985.6313426
  15. Kim, H. W., H. C. Chan and S. Gupta, "Social Media for Business and Society," Asia Pacific Journal of Information Systems, Vol.25, No.2(2015), 211-233. https://doi.org/10.14329/apjis.2015.25.2.211
  16. Kim, M. J., E. J. Song and Y. H. Kim, "A Design of Satisfaction Analysis System for Content Using Opinion Mining of Online Review Data." Journal of Internet Computing and Services, Vol.17, No.3(2016), 107-113. https://doi.org/10.7472/jksii.2016.17.3.107
  17. Kim, S. W. and N. G. Kim, "A Study on the Effect of Using Sentiment Lexicon in Opinion Classification." Journal of Intelligence and Information Systems, Vol.20, No.1(2014), 133-148. https://doi.org/10.13088/jiis.2014.20.1.133
  18. Kwon, J. Y. and M. Y. Lee, "A Study on the Determining Factors of Online Review Helpfulness." Journal of Korea Intelligent information Systems Society, Vol.2012, No.12 (2012), 205-211.
  19. Lee, H. A., W. C. Lee and K. J. Lee, "Antomatic Product Feature Extraction for Efficient Analysis of Product Reviews Using Term Statistics." The KIPS Transactions : Part B, Vol.16, No.6(2009), 497-502.
  20. Lee, J. E., H. K. Seo, and K. Y. Han, "Refined IPC Classification System Based on KNN Using Patent Search Results." Journal of KIISE Academic Publications, Vol.38, No.2A (2011), 256-259.
  21. Lee, M., and H. J. Lee, "Increasing Accuracy of Classifying Useful Reviews by Removing Neutral Terms." Journal of Intelligence and Information Systems, Vol.22, No.3(2016), 129-142. https://doi.org/10.13088/jiis.2016.22.3.129
  22. Lee, S. H., J. Cui and J. W. Kim. "Sentiment Analysis on Movie Review through Building Modified Sentiment Dictionary by Movie Genre." Journal of Intelligence and Information Systems, Vol.22, No.2(2016), 97-113. https://doi.org/10.13088/jiis.2016.22.2.097
  23. Lim, B. H. and K. H. Um, "A Study of the Comparison of Product Quality Competitiveness of Consumer Electronics among Major Countries." Korean Corporation Management Review, Vol.43, No.3(2012), 131-151.
  24. Ma, B., D. Zhang, Z.Yan and T. Kim, "An LDA and Synonym Lexicon based Approach to Product Feature Extraction from Online Consumer Product Reviews." Journal of Electronic Commerce Research, Vol.14, No.4 (2013), 304.
  25. Mikolov, T., E. Grave, P. Bojanowski and C. Puhrsch, "Advances in Pre-Training Distributed Word Representations." arXiv preprint arXiv: 1712.09405 (2017).
  26. Mikolov, T., I. Sutskever, K. Chen, G. S. Corrado and J. Dean, "Distributed Representations of Words and Phrases and their Compositionality." arXiv preprint arXiv:1310.4546 (2013).
  27. Mikolov, T., K. Chen, G. Corrado and J. Dean, "Efficient Estimation of Word Representations in Vector Space." arXiv preprint arXiv:1301.3781, (2013).
  28. Patrick, E. A. and F. P. Fischer III, "A Generalized k-Nearest Neighbor Rule." Information and Control, Vol.16, No.2(1970), 128-152. https://doi.org/10.1016/S0019-9958(70)90081-1
  29. Rasyidi, M. A., J. M. Kim and K. R. Ryu, "Short-term Prediction of Vehicle Speed on Main City Roads Using the K-Nearest Neighbor Algorithm." Journal of Intelligence and Information Systems, Vol.20, No.1(2014), 121-131. https://doi.org/10.13088/jiis.2014.20.1.121
  30. Santosh, D. T., B. V. Vardhan and D. Ramesh, "Extracting Product Features from Reviews Using Feature Ontology Tree Applied on LDA Topic Clusters." 2016 IEEE 6th International Conference on Advanced Computing, IEEE, Bhimavaram, 2016.
  31. Sim, J.S. and H. J. Kim, "A Searching Method for Legal Case Using LDA Topic Modeling" Journal of the Institute of Electronics and Information Engineers, Vol.54, No.9(2017), 67-75. https://doi.org/10.5573/ieie.2017.54.9.67
  32. Son, S. B. and J. H. Chun, "Product Feature Extraction and Rating Distribution Using User Reviews." The Journal of Society for e-Business Studies, Vol.22, No.1(2017), 65-87. https://doi.org/10.7838/jsebs.2017.22.1.065
  33. Song, J. S. and S. W. Lee, " Automatic Construction of Positive/Negative Feature-Predicate Dictionary for Polarity Classification of Product Reviews." Journal of KIISE: Software and Applications, Vol.38, No.3 (2011), 157-168.
  34. Steyvers, M. and T. L. Griffiths, "Probabilistic Topic Models." Handbook of Latent Semantic Analysis, Vol.427, No.7(2007), 424-440.
  35. Wang, W., Y. Feng and W. Dai, "Topic Analysis of Online Reviews for Two Competitive Products Using Latent Dirichlet Allocation." Electronic Commerce Research and Applications, Vol.29, No.13(2018), 142-156. https://doi.org/10.1016/j.elerap.2018.04.003
  36. Xu, K., S. S. Liao, J. Li and Y. Song, "Mining Comparative Opinions from Customer Reviews for Competitive Intelligence." Decision Support Systems, Vol.50, No.4(2011), 743-754. https://doi.org/10.1016/j.dss.2010.08.021
  37. Xu, X., X. Wang, Y. Li and M. Haghighi, "Business Intelligence in Online Customer Textual Reviews: Understanding Consumer Perceptions and Influential Factors." International Journal of Information Management, Vol.37, No.6(2017), 673-683. https://doi.org/10.1016/j.ijinfomgt.2017.06.004
  38. Yagci, I. A. and S. Das, "Measuring Design-Level Information Quality in Online Reviews." Electronic Commerce Research and Applications, Vol.30, No.10(2018), 102-110. https://doi.org/10.1016/j.elerap.2018.05.010
  39. You, E. S., G. H. Choi and S. H. Kim "Study on Extraction of Keywords Using TF-IDF and Text Structure of Novels." Journal of the Korea Society of Computer and Information, Vol.20, No.2(2015), 121-129. https://doi.org/10.9708/jksci.2015.20.2.121