DOI QR코드

DOI QR Code

딥러닝을 위한 텍스트 전처리에 따른 단어벡터 분석의 차이 연구

Study on Difference of Wordvectors Analysis Induced by Text Preprocessing for Deep Learning

  • 고광호 (평택대 스마트자동차학과)
  • 투고 : 2022.07.06
  • 심사 : 2022.08.31
  • 발행 : 2022.09.30

초록

언어모델(Language Model)을 구축하기 위한 딥러닝 기법인 LSTM의 경우 학습에 사용되는 말뭉치의 전처리 방식에 따라 그 결과가 달라진다. 본 연구에서는 유명한 문학작품(기형도의 시집)을 말뭉치로 사용하여 LSTM 모델을 학습시켰다. 원문을 그대로 사용하는 경우와 조사/어미 등을 삭제한 경우에 따라 상이한 단어벡터 세트를 각각 얻을 수 있다. 이러한 전처리 방식에 따른 유사도/유추 연산 결과, 단어벡터의 평면상의 위치 및 언어모델의 텍스트생성 결과를 비교분석했다. 문학작품을 말뭉치로 사용하는 경우, 전처리 방식에 따라 연산된 단어는 달라지지만, 단어들의 유사도가 높고 유추관계의 상관도가 높다는 것을 알 수 있었다. 평면상의 단어 위치 역시 달라지지만 원래의 맥락과 어긋나지 않았고, 생성된 텍스트는 원래의 분위기와 비슷하면서도 이색적인 작품으로 감상할 수 있었다. 이러한 분석을 통해 문학작품을 객관적이고 다채롭게 향유할 수 있는 수단으로 딥러닝 기법의 언어모델을 활용할 수 있다고 판단된다.

It makes difference to LSTM D/L(Deep Learning) results for language model construction as the corpus preprocess changes. An LSTM model was trained with a famouse literaure poems(Ki Hyung-do's work) for training corpus in the study. You get the two wordvector sets for two corpus sets of the original text and eraised word ending text each once D/L training completed. It's been inspected of the similarity/analogy operation results, the positions of the wordvectors in 2D plane and the generated texts by the language models for the two different corpus sets. The suggested words by the silmilarity/analogy operations are changed for the corpus sets but they are related well considering the corpus characteristics as a literature work. The positions of the wordvectors are different for each corpus sets but the words sustained the basic meanings and the generated texts are different for each corpus sets also but they have the taste of the original style. It's supposed that the D/L language model can be a useful tool to enjoy the literature in object and in diverse with the analysis results shown in the study.

키워드

참고문헌

  1. K. Hyungsuc, Y. Janghoon, "Analyzing Semantic Relations of Word Vectors trained by The Word2vec Model", Journal of KIISE, 46(10), pp. 1088-1093, 2019 https://doi.org/10.5626/jok.2019.46.10.1088
  2. K. Kwangho, "Deep Learning Application for Core Image Analysis of the Poems by Ki Hyung-Do," Journal of the Convergence on Culture Technology, 7(3), pp. 591-598, 2021. https://doi.org/10.17703/JCCT.2021.7.3.591
  3. F. Heimerl, M. Gleicher, "Interactive Analysis of Word Vector Embeddings", Computer Graphics Forum, 37(3), pp. 253-265, 2018 https://doi.org/10.1111/cgf.13417
  4. A. Basirat, "Real-valued Syntactic Word Vectors", Journal of Experimental & Theoretical Artificial Intelligence, 32(4), pp. 557-579, 2020 https://doi.org/10.1080/0952813X.2019.1653385
  5. Y. Chang, et al., "Using Word Semantic Concepts for Plagiarism Detection in Text Documents", Information Retrieval Journal, 24(4-5), pp.298-321. 2021 https://doi.org/10.1007/s10791-021-09394-4
  6. K. Sinjae, "Learning Tagging Ontology from Large Tagging Data," Journal of Korean Institute of Intelligent Systems, 18(2), pp. 157-162, 2008. https://doi.org/10.5391/JKIIS.2008.18.2.157
  7. K. Kwangho, et al., "Input Dimension Reduction based on Continuous Word Vector for Deep Neural Network Language Model," Phonetics and Speech Sciences, 7(4), pp. 3-8, 2015. https://doi.org/10.13064/KSSS.2015.7.4.003
  8. A. Gavric, et al., "Real-Time Data Processing Techniques for a Scalable Spatial and Temporal Dimension Reduction", 21st International Symposium(INFOTEH), pp. 1-6, 2022
  9. Y. Lee, et al., "Applying Convolution Filter to Matrix of Word-clustering Based Document Representation", Neurocomputing, 315, pp.210- 220, 2018, doi:10.1016/j.neucom.2018.07.018
  10. L. Hickman, et al., "Text Preprocessing for Text Mining in Organizational Research: Review and Recommendations", Organizational Research Methods,25(1), pp.114-146, 2022 https://doi.org/10.1177/1094428120971683
  11. N. Fatima, et al., "A Systematic Literature Review on Text Generation Using Deep Neural Network Models", IEEE Access, 10, 53490-53503. 2022 https://doi.org/10.1109/ACCESS.2022.3174108