Browse > Article
http://dx.doi.org/10.30693/SMJ.2022.11.5.17

Korean Part-Of-Speech Tagging by using Head-Tail Tokenization  

Suh, Hyun-Jae (국민대학교 컴퓨터공학과)
Kim, Jung-Min (국민대학교 컴퓨터공학과)
Kang, Seung-Shik (국민대학교 컴퓨터공학과)
Publication Information
Smart Media Journal / v.11, no.5, 2022 , pp. 17-25 More about this Journal
Abstract
Korean part-of-speech taggers decompose a compound morpheme into unit morphemes and attach part-of-speech tags. So, here is a disadvantage that part-of-speech for morphemes are over-classified in detail and complex word types are generated depending on the purpose of the taggers. When using the part-of-speech tagger for keyword extraction in deep learning based language processing, it is not required to decompose compound particles and verb-endings. In this study, the part-of-speech tagging problem is simplified by using a Head-Tail tokenization technique that divides only two types of tokens, a lexical morpheme part and a grammatical morpheme part that the problem of excessively decomposed morpheme was solved. Part-of-speech tagging was attempted with a statistical technique and a deep learning model on the Head-Tail tokenized corpus, and the accuracy of each model was evaluated. Part-of-speech tagging was implemented by TnT tagger, a statistical-based part-of-speech tagger, and Bi-LSTM tagger, a deep learning-based part-of-speech tagger. TnT tagger and Bi-LSTM tagger were trained on the Head-Tail tokenized corpus to measure the part-of-speech tagging accuracy. As a result, it showed that the Bi-LSTM tagger performs part-of-speech tagging with a high accuracy of 99.52% compared to 97.00% for the TnT tagger.
Keywords
Part-Of-Speech Tagging; Head-Tail Tokenization; TnT Tagger; Bi-LSTM Tagger;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Barbara Plank, Anders Sogaard, Yoav Goldberg, "Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss," In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 412-418, Berlin, Germany, August. 2016.
2 강승식, "다층 형태론과 한국어 형태소 분석 모델," 제6회 한글 및 한국어 정보처리 학술발표 논문집, 140-145쪽, 1994년 11월
3 윤준영, 이재성, "한국어 형태소 분석 및 품사 태깅을 위한 딥 러닝 기반 2단계 파이프라인 모델," 정보과학회논문지, 제48권, 제4호, 444-452쪽, 2021년 4월
4 김선우, 최성필, "Bidirectional LSTM-CRF 기반의 음절 단위 한국어 품사 태깅 및 띄어쓰기 통합 모델 연구," 정보과학회논문지, 제45권, 제8호, 792-800쪽, 2018년 08월
5 이현영, 김정민, 강승식, "대용량 말뭉치를 이용한 한국어 Head-Tail 토큰화," 제12회 융합 스마트미디어시스템 워크샵, 25-28쪽, 2021년 7월
6 Thorsten Brants, "TnT - A Statistical Part-of-Speech Tagger", In Sixth Applied Natural Language Processing Conference, pp. 224-231, Seattle, Washington, USA, Apr. 2000.
7 Mike Schuster and Kuldip K. Paliwal, "Bidirectional Recurrent Neural Networks," IEEE Transactions on Signal Processing, Vol. 9, pp. 2673-2681, Nov. 1997.
8 Zhiheng Huang, Wei X, Kai Yu, "Bidirectional LSTM-CRF Models for Sequence Tagging," arXiv preprint arXiv:1508.01991, August. 2015.
9 Rushali Dhumal Deshmukh, Arvind Kiwelekar, "Deep Learning Techniques for Part of Speech Tagging by Natural Language Processing," 2020 2nd International Conference on Innovative Mechanisms for Industry Applications(ICIMIA), pp. 76-81, Bangalore, India, Apr. 2020.
10 Andrew Matteson, Chanhee Lee, Heuiseok Lim and Young-Bum Kim, "Rich Character-Level Information for Korean Morphological Analysis and Part-of-Speech Tagging," In Proceedings of the 27th International Conference on Computational Linguistics, pp. 2482-2492, Santa Fe, New Mexico, USA, 2018.
11 이건일, "Sequence-to-sequence 기반 한국어 형태소 분석 및 품사 태깅," 정보과학회논문지, 제44권, 제1호, 57-62쪽, 2017년 01월
12 강승식, "음절 특성을 이용한 한국어 불규칙 용언의 형태소 분석," 정보과학회논문지(B), 제22권, 제10호, 1480-1487쪽, 1995년 10월
13 Sepp Hochreiter, Jurgen Schmidhuber, "Long Short-Term Memory," Neural Computation, Vol. 9, pp. 1753-1780, Nov. 1997.