Browse > Article
http://dx.doi.org/10.30693/SMJ.2021.10.1.16

Sentiment Analysis using Robust Parallel Tri-LSTM Sentence Embedding in Out-of-Vocabulary Word  

Lee, Hyun Young (국민대학교 컴퓨터공학과 대학원)
Kang, Seung Shik (국민대학교 컴퓨터공학과)
Publication Information
Smart Media Journal / v.10, no.1, 2021 , pp. 16-24 More about this Journal
Abstract
The exiting word embedding methodology such as word2vec represents words, which only occur in the raw training corpus, as a fixed-length vector into a continuous vector space, so when mapping the words incorporated in the raw training corpus into a fixed-length vector in morphologically rich language, out-of-vocabulary (OOV) problem often happens. Even for sentence embedding, when representing the meaning of a sentence as a fixed-length vector by synthesizing word vectors constituting a sentence, OOV words make it challenging to meaningfully represent a sentence into a fixed-length vector. In particular, since the agglutinative language, the Korean has a morphological characteristic to integrate lexical morpheme and grammatical morpheme, handling OOV words is an important factor in improving performance. In this paper, we propose parallel Tri-LSTM sentence embedding that is robust to the OOV problem by extending utilizing the morphological information of words into sentence-level. As a result of the sentiment analysis task with corpus in Korean, we empirically found that the character unit is better than the morpheme unit as an embedding unit for Korean sentence embedding. We achieved 86.17% accuracy on the sentiment analysis task with the parallel bidirectional Tri-LSTM sentence encoder.
Keywords
Sentence embedding; Morpheme embedding; Character embedding; LSTM encoder; OOV word;
Citations & Related Records
연도 인용수 순위
  • Reference
1 이태석, 강승식, "LSTM 기반의 sequence-to-sequence 모델을 이용한 한글 자동 띄어쓰기," 스마트미디어저널, 제7권, 제4호, 17-23쪽, 2018년   DOI
2 이태석, 선충녕, 정영임, 강승식, "미등록 어휘에 대한 선택적 복사를 적용한 문서 자동요약," 스마트미디어저널, 제8권, 제2호, 58-65쪽, 2019년 06월   DOI
3 이명호, 임명진, 신주현, "단어와 문장의 의미를 고려한 비속어 판별 방법," 스마트미디어저널, 제9권, 제3호, 98-106쪽, 2020년 9월   DOI
4 이현영, 강승식. "종단 간 심층 신경망을 이용한 한국어 문장 자동 띄어쓰기," 정보처리학회논문지:소프트웨어 및 데이터 공학, 제8권, 제11호, 441-448쪽, 2019년   DOI
5 Marco Baroni, Georgiana Dinu and German Kruszewski, "Don't count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors," In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 238-247, Baltimore, Maryland, USA, Jun. 2014.
6 Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean, "Distributed representations of words and phrases and their compositionality," In Proceedings of Advances in neural information processing systems, pp. 3111-3119, Harrah's Lake Tahoe, USA, Dec. 2013.
7 Jeffrey Pennington, Richard Socher, and Christopher Manning, "GloVe: Global vectors for word representation," In Proceedings of Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532-1543, Doha, Qatar, Oct. 2014.
8 Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio, "Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:1409.0473, 2014.
9 Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov, "Enriching word vectors with subword information," Transactions of the Association for Computational Linguistics, vol. 5, pp. 135-146, 2017.   DOI
10 Sungjoon Park, Jeongmin Byun, Sion Baek, Yongseok Cho, and Alice Oh. "Subword-level word vector representations for Korean," In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 2429-2438, Melbourne, Australia, Jul. 2018.
11 Rico Sennrich, Barry Haddow, and Alexandra Birch, "Neural machine translation of rare words with subword units," In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 1715-1725, Berlin, Germany, Aug. 2016.
12 Nicolas Garneau, Jean-Samuel Leboeuf, and Luc Lamontagne, "Predicting and interpreting embeddings for out of vocabulary words in downstream tasks," In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 331-333, Brussels, Belgium, Nov. 2018.
13 Thang Luong, Richard Socher, and Christopher Manning, "Better word representations with recursive neural networks for morphology," In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 104-113, Sofia, Bulgaria, Aug. 2013.
14 Sebastien Jean, Kyunghyun Cho, Roland Memisevic, Yoshua Bengio, "On using very large target vocabulary for neural machine translation," In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pp. 1-10, Beijing, China, Jul. 2015.
15 Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals, and Wojciech Zaremba, "Addressing the rare word problem in neural machine translation," In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pp. 11-19, Beijing, China, Jul. 2015.
16 조단비, 이현영, 박지훈, 강승식, "형태소 임베딩과 SVM을 이용한 뉴스 기사 정치적 편향성의 자동분류," 한국정보처리학회 2020년 춘계학술발표대회, 제27권, 제01호, 451-454쪽, 2020년 5월
17 Wang Ling, Isabel Trancoso, Chris Dyer, and Alan W Black, "Character-based neural machine translation," arXiv preprint arXiv:1511.04586, 2015.
18 Jianpeng Cheng and Mirella Lapata, "Neural summarization by extracting sentences and words," In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 484-494, Berlin, Germany, Aug. 2016.
19 Andrew M. Dai, Christopher Olah, and Quoc V. Le, "Document embedding with paragraph vectors," arXiv preprint arXiv:1507.07998, Aug. 2015.
20 Ranjan Kumar Behera, Monalisa Jena, Santanu Kumar Rath, and Sanjay Misra,"Co-LSTM: Convolutional LSTM model for sentiment analysis in social big data," Information Processing & Management, vol. 58, issue 1, 2021.
21 이현영, 강승식, "문맥 의존 병렬 Trigram 문장 임베딩," 한국정보과학회 2020 한국소프트웨어종합 학술대회 (online), 305-306쪽, 2020년 12월
22 Dan-Bi Cho, Hyun-Young Lee, and Seung-Shik Kang, "Sentiment analysis for informal text by using SentencePiece tokenizer and subword embedding," In Proceedings of Korea Computer Congress 2020 (online), vol. 47, no. 1, pp. 395-397, Jul. 2020.
23 Wang Ling, Chris Dyer, Alan W Black, Isabel Trancoso, Ramon Fermandez, Silvio Amir, Luis Marujo, and Tiago Luis, "Finding Function in Form: Compositional character models for open vocabulary word representation," In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1520-1530, Lisbon, Portugal, Sep. 2015.
24 Sepp Hochreiter and Jurgen Schmidhuber, "Long short-term memory," Neural Computation, vol. 9, Issue 8, pp.1735-1780, 1997.   DOI