Browse > Article
http://dx.doi.org/10.14369/jkmc.2019.32.3.047

Comparison of Word Extraction Methods Based on Unsupervised Learning for Analyzing East Asian Traditional Medicine Texts  

Oh, Junho (Korea Institute of Oriental Medicine)
Publication Information
Journal of Korean Medical classics / v.32, no.3, 2019 , pp. 47-57 More about this Journal
Abstract
Objectives : We aim to assist in choosing an appropriate method for word extraction when analyzing East Asian Traditional Medical texts based on unsupervised learning. Methods : In order to assign ranks to substrings, we conducted a test using one method(BE:Branching Entropy) for exterior boundary value, three methods(CS:cohesion score, TS:t-score, SL:simple-ll) for interior boundary value, and six methods(BExSL, BExTS, BExCS, CSxTS, CSxSL, TSxSL) from combining them. Results : When Miss Rate(MR) was used as the criterion, the error was minimal when the TS and SL were used together, while the error was maximum when CS was used alone. When number of segmented texts was applied as weight value, the results were the best in the case of SL, and the worst in the case of BE alone. Conclusions : Unsupervised-Learning-Based Word Extraction is a method that can be used to analyze texts without a prepared set of vocabulary data. When using this method, SL or the combination of SL and TS could be considered primarily.
Keywords
Text segmentation; Word extraction; tokenization; East Asian Traditional Medicine; Korean medicine;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Huang Yongnian. Introduction of Ancient books Arrangement. Institute for the Translation of Korean Classics. 2018. 2013. p.209.
2 Hyun-joong Kim, Sungzoon Cho, Pilsung Kang. KR-WordRank : An Unsupervised Korean Word Extraction Method Based on WordRank. Journal of the Korean Institute of Industrial Engineers. 2014. 40(1). pp.18-33.   DOI
3 Stefan Bordag. A Comparison of Co-occurrence and Similarity Measures as Simulations of Context. Computational Linguistics and Intelligent Text Processing. Alexander Gelbukh. Computational Linguistics and Intelligent Text Processing. Springer. 2008. pp 52-63.
4 Zhihui Jin, Kumiko Tanaka-Ishii. Unsupervised Segmentation of Chinese Text by Use of Branching Entropy. Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions. 2006. pp.428-435.
5 김현중, 조성준, 강필성. KR-WordRank : WordRank를 개선한 비지도학습 기반 한국어 단어 추출 방법. 대한산업공학회지. 2014. 40(1). pp.18-33.   DOI
6 黃永年(김언종, 김수경 옮김). 고적정리개론. 한국고전번역원. 2018. 2013. p.209.
7 Chinese Medical Database. Beijing. Hunan Electronic Audio and Video Publishing House. 2003.
8 Hyunjoong Kim. LOVITxDATA SCIENCE. [cited on July 17, 2019]. Avaiable from: https://lovit.github.io/nlp/2018/04/09/cohesion_ltokenizer
9 Korea Institute of Oriental Medicine. Mediclassics. [cited on Jan 12, 2019]. Avaiable from: https://mediclassics.kr
10 中华医典. 中国中医药学会, 湖南电子音像出版社, 嘉鸿科技开发有限公司. 2003
11 한국한의학연구원. 한의학고전DB. [cited on July 17, 2019]. Avaiable from: https://mediclassics.kr