The Sentence Similarity Measure Using Deep-Learning and Char2Vec

Lim, Geun-Young;Cho, Young-Bok;

doi:10.6109/jkiice.2018.22.10.1300

Journal of the Korea Institute of Information and Communication Engineering (한국정보통신학회논문지)

Volume 22 Issue 10
/
Pages.1300-1306
/
2018
/
2234-4772(pISSN)
/
2288-4165(eISSN)

The Korea Institute of Information and Commucation Engineering (한국정보통신학회)

DOI QR Code

The Sentence Similarity Measure Using Deep-Learning and Char2Vec

딥러닝과 Char2Vec을 이용한 문장 유사도 판별

Lim, Geun-Young (Department of Information Security, Daejeon University) ;
Cho, Young-Bok (Department of Information Security, Daejeon University)

임근영 ;
조영복

Received : 2018.06.28
Accepted : 2018.07.19
Published : 2018.10.31

https://doi.org/10.6109/jkiice.2018.22.10.1300 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

The purpose of this study is to see possibility of Char2Vec as alternative of Word2Vec that most famous word embedding model in Sentence Similarity Measure Problem by Deep-Learning. In experiment, we used the Siamese Ma-LSTM recurrent neural network architecture for measure similarity two random sentences. Siamese Ma-LSTM model was implemented with tensorflow. We train each model with 200 epoch on gpu environment and it took about 20 hours. Then we compared Word2Vec based model training result with Char2Vec based model training result. as a result, model of based with Char2Vec that initialized random weight record 75.1% validation dataset accuracy and model of based with Word2Vec that pretrained with 3 million words and phrase record 71.6% validation dataset accuracy. so Char2Vec is suitable alternate of Word2Vec to optimize high system memory requirements problem.

본 연구는 자연어 처리 문제 중 하나인 문장 유사도 판별 문제를 딥러닝으로 해결하는 데에 있어 Char2Vec기반으로 문장을 전 처리하고 학습시켜 그 성능을 확인하고 대표적인 Word Embedding 모델 Word2Vec를 대체할 수 있는 가능성이 있는지 파악하고자 한다. 임의의 두 문장을 비교할 때 쓰는 딥러닝 구조로 Siamese Ma-STM 네트워크를 사용하였다. Word2Vec와 Char2Vec를 각각 기반으로 한 문장 유사도 판별 모델을 학습시키고 그 결과를 분석하였다. 실험 결과 Char2Vec를 기반으로 학습시킨 모델이 validation accuracy 75.1%을 보였고 Word2Vec를 기반으로 학습시킨 모델은 validation accuracy 71.6%를 보였다. 따라서 고 사양을 요구하는 Word2Vec대신 임베딩 레이어를 활용한 Char2Vec 기반의 전처리 모델을 활용함으로 분석 환경을 최적화 할 수 있다.

Keywords

딥러닝;

References

S. J. Park, S. M. Choi, H. J. Lee, J. B. Kim, "Spatial analysis using R based Deep Learning," Asia-pacific Journal of Multimedia Services Convergent with Art, Humanities, and Sociology, vol. 6, no. 4, pp. 1-8, April 2016.
J. M. Kim and J. H. Lee, "Text Document Classification Based on Recurrent Neural Network Using Word2vec," Journal of korean Institute of Intelligent System, vol. 27, no.6, pp. 560-565, Jun. 2017. https://doi.org/10.5391/JKIIS.2017.27.6.560
P. Baudis, S. Stanko and J. Sedivy, "Joint Learning of Sentence Embeddings for Relevance and Entailment," in The Workshop on Representation Learning for NLP, Berlin, Germany, pp. 18-26, 2016.
J. Y. Kim and E. H. Park, "e-Learning Course Reviews Analysis based on Big Data Analytics," Journal of the Korea Institute of Information and Communication Engineering, Vol. 21, No. 2, pp. 423-428, Feb. 2017. https://doi.org/10.6109/JKIICE.2017.21.2.423
J. M. Kim and J. H. Lee, "Text Document Classification Based on Recurrent Neural Network Using Word2vec," Journal of Korean Institute of Intelligent Systems, Vol. 27, No. 6, pp. 560-565, Dec. 2017. https://doi.org/10.5391/JKIIS.2017.27.6.560
M. Jonas, and A. Thyagarajan. "Siamese Recurrent Architectures for Learning Sentence Similarity," in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Arizona, pp. 2786-2792, 2016.
Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, "Character-Aware Neural Language Models," in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence ,Arizona, pp. 2741-2749 , 2016.
Naver ai hackerton 2018 Team sadang solution [Internet]. Available:https://github.com/moonbings/naver-ai-hackathon-2018.
R. Dey and F. M. Salem. "Gate-variants of gated recurrent unit (GRU) neural networks," in 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), Boston, pp. 1597-1600 , 2017.
wiki fast .ai Logloss [Internet]. Available: http://wiki.fast.ai/index.php/Log_Loss
D. P. Kingma, J. Ba, "Adam: A Method for Stochastic Optimization," in The 3rd International Conference for Learning Representations, pp. 1-15, San Diego, 2015.

Cited by

신경학적 손상에 의한 언어장애인 음성 인식률 개선(H/W, S/W)에 관한 연구 vol.23, pp.11, 2019, https://doi.org/10.6109/jkiice.2019.23.11.1397
이진 분류를 위하여 거리계산을 이용한 특징 변환 기반의 가중된 최소 자승법 vol.24, pp.2, 2018, https://doi.org/10.6109/jkiice.2020.24.2.219