DOI QR코드

DOI QR Code

욕설문장 분류의 불균형 데이터 해결을 위한 전이학습 방법

A Transfer Learning Method for Solving Imbalance Data of Abusive Sentence Classification

  • 서수인 (연세대학교 컴퓨터과학과) ;
  • 조성배 (연세대학교 컴퓨터과학과)
  • 투고 : 2017.08.08
  • 심사 : 2017.10.11
  • 발행 : 2017.12.15

초록

욕설문장을 지도학습 접근법으로 분류하기 위해서 욕설인지 아닌지 판별된 학습 문장이 필요하다. 문자수준의 컨볼루션 신경망이 각 문자에 대해 강건성을 가지기 때문에 욕설분류에 적합하지만, 학습에 많은 데이터가 필요하다는 단점이 있다. 본 논문에서는 이를 해결하기 위해 임의로 생성한 욕설/비욕설 문장 쌍을 컨볼루션 신경망을 기반으로 하는 분류기에 학습시켜 컨볼루션 신경망의 필터가 욕설의 특징을 분류하도록 조정한 후, 실제 훈련문장을 학습시킬 때 필터를 재사용하는 전이학습방법을 제안한다. 이로써 데이터 부족과 클래스 불균형으로 인한 영향이 감소하여 분류 성능이 향상될 것이다. 실험 및 평가는 총 3가지 데이터에 대해 수행되었으며, 문자수준 컨볼루션 신경망을 활용한 분류기는 모든 데이터에서 전이학습을 적용했을 때 더 높은 F1 점수를 획득하였다.

The supervised learning approach is suitable for classification of insulting sentences, but pre-decided training sentences are necessary. Since a Character-level Convolution Neural Network is robust for each character, so is appropriate for classifying abusive sentences, however, has a drawback that demanding a lot of training sentences. In this paper, we propose transfer learning method that reusing the trained filters in the real classification process after the filters get the characteristics of offensive words by generated abusive/normal pair of sentences. We got higher performances of the classifier by decreasing the effects of data shortage and class imbalance. We executed experiments and evaluations for three datasets and got higher F1-score of character-level CNN classifier when applying transfer learning in all datasets.

키워드

과제정보

연구 과제번호 : 상대방의 감정을 추론, 판단하여 그에 맞추어 대화하고 대응할 수 있는 감성지능 기술 연구개발

연구 과제 주관 기관 : 정보통신기술진흥센터

참고문헌

  1. S. Sood, E. Churchill, and J. Antin, "Profanity use in online communities," Proc. of SIGCHI Conf. on Human Factors in Computing Systems, pp. 1481-1490, 2012.
  2. Y. Chen, Y. Zhou, S. Zhu, and H. Xu, "Detecting offensive language in social media to protect adolescent online safety," Int. Conf. on Social Computing, pp. 71-80, 2012.
  3. X. Zhang, J. Zhao, and Y. LeCun, "Character-level convolutional networks for text classification," Advances in Neural Information Processing Systems, pp. 649-657, 2015.
  4. S. Sood, E. Churchill, and J. Antin, “Automatic identification of personal insults on social news sites,” Journal of Association for Information Science and Technology, Vol. 63, No. 2, pp. 270-285, 2012.
  5. G. Xiang, B. Fan, L. Wang, J. Hong, and C. Rose, "Detecting offensive tweets via topical feature discovery over a large scale twitter corpus," Int. Conf. on Information and knowledge management, pp. 1980-1984, 2012.
  6. W. Zhang, T. Yoshida, & X. Tang, “A comparative study of TF* IDF, LSI and multi-words for text classification,” Journal of Expert Systems with Applications, Vol. 38, No. 3, pp. 2758-2765, 2011. https://doi.org/10.1016/j.eswa.2010.08.066
  7. D. Ramage, S. T. Dumais, and D. J. Liegling, "Characterizing microblogs with topic models," Int. Conf. on Web and Social Media, pp. 1-1, 2010.
  8. N. Djuric, J. Zhou, R. Morris, M. Grbovic, V. Radosavljevic, and N. Bhamidipati, "Hate speech detection with comment embeddings," Conf. on World Wide Web, pp. 29-30, 2015.
  9. K. Dinakar, R. Reichart, and H. Lieberman, "Modeling the detection of Textual Cyberbullying," Int. Conf. on Weblog and Social Media, Social Mobile Web Workshop, pp. 11-17, 2011.
  10. N. D. Gitari, Z. Zuping, H. Damien, and J. Long, “A lexicon-based approach for hate speech detection,” Journal of Multimedia and Ubiquitous Engineering, Vol. 10, No. 4, pp. 215-230, 2015.
  11. Q. V. Le, and T. Mikolov, "Distributed representations of sentences and documents," Int. Conf. on Machine Learning, pp. 1188-1196, 2014.
  12. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," Advances in Neural Information Processing Systems, pp. 3111-3119, 2013.
  13. C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad, and Y. Chang, "Abusive language detection in online user content," Conf. on World Wide Web, pp. 145-153, 2016.
  14. Y. Kim, "Convolutional neural networks for sentence classification," Conf. on Empirical Method in Natural Language Processing, pp. 1746-1754, 2014.
  15. C. Zhou, C. Sun, Z. Liu, and L. Lau, "A C-LSTM neural network for text classification," preprint arXiv:1511.08630, 2015.
  16. Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, "Character-aware neural language models," Association for the Advancement of Artificial Intelligence, pp. 2741-2749, 2016.
  17. J. Wang, L. C. Yu, K. R. Lai, and X. Zhang, "Dimensional sentiment analysis using a regional CNN-LSTM model," Annual Meeting of the Association for Computational Linguistics, pp. 225-230, 2016.
  18. T. Chen, R. Xu, Q. Lu, B. Liu, J. Xu, and Z. He, "A Sentence vector based over-sampling method for imbalanced emotion classification," Int. Conf. on Intelligent Text Processing and Computational Linguistics, pp. 62-72, 2014.
  19. M. Iyyer, J. Boyd-Graber, and H. Daume III, "Generating sentences from semantic vector space representations," Proc. NIPS Workshop on Learning Semantics, 2014.
  20. R. Lewand, Cryptological Mathematics, pp. 199, The Mathematical Association of America, 2000.
  21. J. M. Xu, K. S. Jun, X. Zhu, and A. Bellmore, "Learning from bullying traces in social media," Conf. of North American chapter of Association for Computational Linguistics: Human Language Technologies, pp. 656-666, 2012.
  22. N. V. Chawia, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority oversampling technique," Journal of Artificial Intelligence Research, Vol. 16, pp. 321-357, 2002. https://doi.org/10.1613/jair.953
  23. K. He, and J. Sun, "Convolutional neural networks at constrained time cost," Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pp. 5353-5360, 2015.