DOI QR코드

DOI QR Code

Sentiment Analysis using Robust Parallel Tri-LSTM Sentence Embedding in Out-of-Vocabulary Word

Out-of-Vocabulary 단어에 강건한 병렬 Tri-LSTM 문장 임베딩을 이용한 감정분석

  • 이현영 (국민대학교 컴퓨터공학과 대학원) ;
  • 강승식 (국민대학교 컴퓨터공학과)
  • Received : 2020.12.18
  • Accepted : 2021.02.19
  • Published : 2021.03.31

Abstract

The exiting word embedding methodology such as word2vec represents words, which only occur in the raw training corpus, as a fixed-length vector into a continuous vector space, so when mapping the words incorporated in the raw training corpus into a fixed-length vector in morphologically rich language, out-of-vocabulary (OOV) problem often happens. Even for sentence embedding, when representing the meaning of a sentence as a fixed-length vector by synthesizing word vectors constituting a sentence, OOV words make it challenging to meaningfully represent a sentence into a fixed-length vector. In particular, since the agglutinative language, the Korean has a morphological characteristic to integrate lexical morpheme and grammatical morpheme, handling OOV words is an important factor in improving performance. In this paper, we propose parallel Tri-LSTM sentence embedding that is robust to the OOV problem by extending utilizing the morphological information of words into sentence-level. As a result of the sentiment analysis task with corpus in Korean, we empirically found that the character unit is better than the morpheme unit as an embedding unit for Korean sentence embedding. We achieved 86.17% accuracy on the sentiment analysis task with the parallel bidirectional Tri-LSTM sentence encoder.

word2vec 등 기존의 단어 임베딩 기법은 원시 말뭉치에 출현한 단어들만을 대상으로 각 단어를 다차원 실수 벡터 공간에 고정된 길이의 벡터로 표현하기 때문에 형태론적으로 풍부한 표현체계를 가진 언어에 대한 단어 임베딩 기법에서는 말뭉치에 출현하지 않은 단어들에 대한 단어 벡터를 표현할 때 OOV(out-of-vocabulary) 문제가 빈번하게 발생한다. 문장을 구성하는 단어 벡터들로부터 문장 벡터를 구성하는 문장 임베딩의 경우에도 OOV 단어가 포함되었을 때 문장 벡터를 정교하게 구성하지 못하는 문제점이 있다. 특히, 교착어인 한국어는 어휘형태소와 문법형태소가 결합되는 형태론적 특성 때문에 미등록어의 임베딩 기법은 성능 향상의 중요한 요인이다. 본 연구에서는 단어의 형태학적인 정보를 이용하는 방식을 문장 수준으로 확장하고 OOV 단어 문제에 강건한 병렬 Tri-LSTM 문장 임베딩을 제안한다. 한국어 감정 분석 말뭉치에 대해 성능 평가를 수행한 결과 한국어 문장 임베딩을 위한 임베딩 단위는 형태소 단위보다 문자 단위가 우수한 성능을 보였으며, 병렬 양방향 Tri-LSTM 문장 인코더는 86.17%의 감정 분석 정확도를 달성하였다.

Keywords

References

  1. 이태석, 강승식, "LSTM 기반의 sequence-to-sequence 모델을 이용한 한글 자동 띄어쓰기," 스마트미디어저널, 제7권, 제4호, 17-23쪽, 2018년 https://doi.org/10.30693/smj.2018.7.4.17
  2. 이태석, 선충녕, 정영임, 강승식, "미등록 어휘에 대한 선택적 복사를 적용한 문서 자동요약," 스마트미디어저널, 제8권, 제2호, 58-65쪽, 2019년 06월 https://doi.org/10.30693/smj.2019.8.2.58
  3. 이명호, 임명진, 신주현, "단어와 문장의 의미를 고려한 비속어 판별 방법," 스마트미디어저널, 제9권, 제3호, 98-106쪽, 2020년 9월 https://doi.org/10.30693/SMJ.2020.9.3.98
  4. 이현영, 강승식. "종단 간 심층 신경망을 이용한 한국어 문장 자동 띄어쓰기," 정보처리학회논문지:소프트웨어 및 데이터 공학, 제8권, 제11호, 441-448쪽, 2019년 https://doi.org/10.3745/ktsde.2019.8.11.441
  5. Marco Baroni, Georgiana Dinu and German Kruszewski, "Don't count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors," In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 238-247, Baltimore, Maryland, USA, Jun. 2014.
  6. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean, "Distributed representations of words and phrases and their compositionality," In Proceedings of Advances in neural information processing systems, pp. 3111-3119, Harrah's Lake Tahoe, USA, Dec. 2013.
  7. Jeffrey Pennington, Richard Socher, and Christopher Manning, "GloVe: Global vectors for word representation," In Proceedings of Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532-1543, Doha, Qatar, Oct. 2014.
  8. Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio, "Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:1409.0473, 2014.
  9. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov, "Enriching word vectors with subword information," Transactions of the Association for Computational Linguistics, vol. 5, pp. 135-146, 2017. https://doi.org/10.1162/tacl_a_00051
  10. Sungjoon Park, Jeongmin Byun, Sion Baek, Yongseok Cho, and Alice Oh. "Subword-level word vector representations for Korean," In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 2429-2438, Melbourne, Australia, Jul. 2018.
  11. Rico Sennrich, Barry Haddow, and Alexandra Birch, "Neural machine translation of rare words with subword units," In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 1715-1725, Berlin, Germany, Aug. 2016.
  12. Nicolas Garneau, Jean-Samuel Leboeuf, and Luc Lamontagne, "Predicting and interpreting embeddings for out of vocabulary words in downstream tasks," In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 331-333, Brussels, Belgium, Nov. 2018.
  13. Sebastien Jean, Kyunghyun Cho, Roland Memisevic, Yoshua Bengio, "On using very large target vocabulary for neural machine translation," In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pp. 1-10, Beijing, China, Jul. 2015.
  14. Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals, and Wojciech Zaremba, "Addressing the rare word problem in neural machine translation," In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pp. 11-19, Beijing, China, Jul. 2015.
  15. 조단비, 이현영, 박지훈, 강승식, "형태소 임베딩과 SVM을 이용한 뉴스 기사 정치적 편향성의 자동분류," 한국정보처리학회 2020년 춘계학술발표대회, 제27권, 제01호, 451-454쪽, 2020년 5월
  16. Thang Luong, Richard Socher, and Christopher Manning, "Better word representations with recursive neural networks for morphology," In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 104-113, Sofia, Bulgaria, Aug. 2013.
  17. Wang Ling, Isabel Trancoso, Chris Dyer, and Alan W Black, "Character-based neural machine translation," arXiv preprint arXiv:1511.04586, 2015.
  18. Jianpeng Cheng and Mirella Lapata, "Neural summarization by extracting sentences and words," In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 484-494, Berlin, Germany, Aug. 2016.
  19. Andrew M. Dai, Christopher Olah, and Quoc V. Le, "Document embedding with paragraph vectors," arXiv preprint arXiv:1507.07998, Aug. 2015.
  20. Ranjan Kumar Behera, Monalisa Jena, Santanu Kumar Rath, and Sanjay Misra,"Co-LSTM: Convolutional LSTM model for sentiment analysis in social big data," Information Processing & Management, vol. 58, issue 1, 2021.
  21. Dan-Bi Cho, Hyun-Young Lee, and Seung-Shik Kang, "Sentiment analysis for informal text by using SentencePiece tokenizer and subword embedding," In Proceedings of Korea Computer Congress 2020 (online), vol. 47, no. 1, pp. 395-397, Jul. 2020.
  22. Wang Ling, Chris Dyer, Alan W Black, Isabel Trancoso, Ramon Fermandez, Silvio Amir, Luis Marujo, and Tiago Luis, "Finding Function in Form: Compositional character models for open vocabulary word representation," In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1520-1530, Lisbon, Portugal, Sep. 2015.
  23. Sepp Hochreiter and Jurgen Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, Issue 8, pp.1735-1780, 1997. https://doi.org/10.1162/neco.1997.9.8.1735
  24. 이현영, 강승식, "문맥 의존 병렬 Trigram 문장 임베딩," 한국정보과학회 2020 한국소프트웨어종합 학술대회 (online), 305-306쪽, 2020년 12월