Linguistic Features Discrimination for Social Issue Risk Classification

Oh, Hyo-Jung;Yun, Bo-Hyun;Kim, Chan-Young;

doi:10.3745/KTSDE.2016.5.11.541

KIPS Transactions on Software and Data Engineering (정보처리학회논문지:소프트웨어 및 데이터공학)

Volume 5 Issue 11
/
Pages.541-548
/
2016
/
2287-5905(pISSN)
/
2734-0503(eISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

Linguistic Features Discrimination for Social Issue Risk Classification

사회적 이슈 리스크 유형 분류를 위한 어휘 자질 선별

오효정 (전북대학교 대학원 기록관리학과) ;
윤보현 (목원대학교 컴퓨터교육과) ;
김찬영 (전북대학교 의학전문대학원)

Received : 2016.10.04
Accepted : 2016.10.12
Published : 2016.11.30

https://doi.org/10.3745/KTSDE.2016.5.11.541 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

The use of social media is already essential as a source of information for listening user's various opinions and monitoring. We define social 'risks' that issues effect negative influences for public opinion in social media. This paper aims to discriminate various linguistic features and reveal their effects for building an automatic classification model of social risks. Expecially we adopt a word embedding technique for representation of linguistic clues in risk sentences. As a preliminary experiment to analyze characteristics of individual features, we revise errors in automatic linguistic analysis. At the result, the most important feature is NE (Named Entity) information and the best condition is when combine basic linguistic features. word embedding, and word clusters within core predicates. Experimental results under the real situation in social bigdata - including linguistic analysis errors - show 92.08% and 85.84% in precision respectively for frequent risk categories set and full test set.

사용자의 다양한 의견을 수렴하고 모니터링하기 위한 정보원으로써 소셜미디어의 활용은 이미 필수가 되었다. 본 논문은 소셜미디어에 나타난 다양한 이슈 중 여론 형성에 악영향을 끼치는 부정적 사건을 이슈 '리스크'로 정의, 그 세부 유형을 자동으로 분류하는 모델을 개발하고자 한다. 이를 위해 소셜미디어에 나타난 다양한 어휘 자질을 선별, 그 효과를 규명하였다. 특히 리스크 문장의 어휘 구문 특징을 표현하기 위한 자질로 워드 임베딩 학습 결과를 활용한다. 개별 어휘 자질의 특징을 분석하기 위해 언어분석 오류를 보정한 환경에서 수행한 실험 결과, 가장 효과가 큰 자질은 개체명 자질로 분석되었으며, 기본 어휘 자질을 기반으로 주요 술부의 워드 임베딩 결과와 워드 클러스터 결과를 모두 조합한 경우가 최고 성능을 보이는 것으로 파악되었다. 실제 소셜빅데이터에 적용하는 환경과 유사하도록 자동 언어분석 결과의 오류를 포함한 조건에서 실험한 결과, 고빈도 평가셋에서는 92.08%의 성능을, 전체 58개 범주 평가셋에서는 85.84%의 성능을 얻었다.

Keywords

References

G. H. Kim, S. Trimi, and J. H. Chung, "Big-data applications in the government sector," Communications of the ACM, Vol.57, No.3, pp.78-85, 2014.
C. H. Lee, J. Hur, and H. J. Oh, et al., "Technology Trends of Issue Detection and Predictive Analysis on Social Big Data," Electronics and Telecommunications Trends, Vol.28, No.1, pp.62-71, 2013.
J. Hur, C. H. Lee, and H. J. Oh, et al, "Automatic Generation of Issue Analysis Report Based on Social Big Data Mining," Korea Information Science Society (KISS) Journals, Vol.3, No.12, pp.553-564, 2014.
C. H. Hong and H. S. Kim, "Comparative Study of Various Machine-learning Features for Tweets Sentiment Classification," Korea Contents, Vol.12, No.12, pp.471-478. 2012.
M. Y. Ren and S. J. Kang, "Comparison Between Optimal Features of Korean and Chinese for Text Classification," Journal of Korean Institute of Intelligent Systems, Vol.25, No.4, pp.386-391, 2015. https://doi.org/10.5391/JKIIS.2015.25.4.386
Y. S. Chio and J. W. Cha, "Korean Named Entity Recognition and Classification using Word Embedding Features," Journal of KIISE, Vol.43, No.6, pp.678-685, 2016. https://doi.org/10.5626/JOK.2016.43.6.678
Y. Bengio, R. Ducharme, and P. Vincent, "A neural probabilistic language model," Journal of Machine Learning Research, Vol.3, pp.1137-1155, 2003.
T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient Estimation of Word Representations in Vector Space," in Proceedings of the ICLR Workshop, 2013.
B. Shim, J. Park, and J. Seo, "Term Weighting Using Date Information and Its Appliance in Automatic Text Classification," in Proceedings of the 19th Annual Conference on Human and Cognitive Language Technology, Vol.10, pp.169-173, 2007.
J. In, J. Kim, and S. Chae, "Combined Feature Set and Hybrid Feature Selection Method for Effective Document Classification," Journal of Korean Society for Internet Information, Vol.14, No.5, pp.49-57, 2013.
H. K Lee, S. Yang, and Y.J. Ko, "Feature Expansion based on LDA Word Distribution for Performance Improvement of Informal Document Classification," Journal of KIISE, Vol. 43, No.9, pp.1008-1014, 2016. https://doi.org/10.5626/JOK.2016.43.9.1008
Word2vec [Internet], https://code.google.com/p/word2vec/.
H. G Yoon, S. J. Chio, and S. B. Park, "Improving The Performance of Triple Generation Based on Distant Supervision By Using Semantic Similarity," Journal of KIISE, Vol.43, No.6, pp.653-661, 2016. https://doi.org/10.5626/JOK.2016.43.6.653
H. J. Oh, S. J An, and Y. Kim, "Social Issue Risk Type Classification based on Social Bigdata," Jounrnal of the Korea Contents Association, Vol.16, No.8, pp.1-9, 2016.
S. J. Lim, C. K. Lee, and D. Y. Ra, "Dependency-based semantic role labeling using sequence labeling with a structural SVM," Pattern Recognition Letters, Vol.34, No.6, pp.696-702, 2013. https://doi.org/10.1016/j.patrec.2013.01.022