DOI QR코드

DOI QR Code

A Semi-Automatic Semantic Mark Tagging System for Building Dialogue Corpus

대화 말뭉치 구축을 위한 반자동 의미표지 태깅 시스템

  • 박준혁 (한국교통대학교 컴퓨터정보공학과) ;
  • 이성욱 (한국교통대학교 컴퓨터정보공학전공) ;
  • 임윤섭 (한국과학기술연구원 지능로봇연구단/치매DTC융합연구단) ;
  • 최종석 (과학기술연합대학원(UST) HCI 및 로봇 세부전공)
  • Received : 2018.12.28
  • Accepted : 2019.03.21
  • Published : 2019.05.31

Abstract

Determining the meaning of a keyword in a speech dialogue system is an important technology for the future implementation of an intelligent speech dialogue interface. After extracting keywords to grasp intention from user's utterance, the intention of utterance is determined by using the semantic mark of keyword. One keyword can have several semantic marks, and we regard the task of attaching the correct semantic mark to the user's intentions on these keyword as a problem of word sense disambiguation. In this study, about 23% of all keywords in the corpus is manually tagged to build a semantic mark dictionary, a synonym dictionary, and a context vector dictionary, and then the remaining 77% of all keywords is automatically tagged. The semantic mark of a keyword is determined by calculating the context vector similarity from the context vector dictionary. For an unregistered keyword, the semantic mark of the most similar keyword is attached using a synonym dictionary. We compare the performance of the system with manually constructed training set and semi-automatically expanded training set by selecting 3 high-frequency keywords and 3 low-frequency keywords in the corpus. In experiments, we obtained accuracy of 54.4% with manually constructed training set and 50.0% with semi-automatically expanded training set.

지능형 음성 대화 인터페이스 구현에 있어 핵심어의 의미표지는 사용자 의도 파악을 위한 중요한 요소이다. 대화시스템은 사용자 발화의 의도를 파악하기 위해 핵심어와 그 의미표지를 이용하여 발화의 의도를 결정한다. 하나의 핵심어는 여러 개의 의미표지를 가질 수 있는 중의성을 지닌다. 이러한 중의성을 지닌 핵심어를 사용자의 의도와 일치하는 의미표지로 결정하는 것은 단어 의미 분별 문제와 유사하다. 우리는 전사된 대화 말뭉치의 약 23%를 수동으로 의미를 부착하여 핵심어에 대한 의미표지 사전, 유의어 사전, 문맥벡터 사전을 먼저 구축한 후, 나머지 77% 대화 말뭉치에 존재하는 핵심어의 의미를 자동으로 부착한다. 중의성을 가진 핵심어는 문맥벡터 사전으로부터 문맥 벡터 유사도를 계산하여 의미를 결정한다. 핵심어가 미등록어인 경우에는 유의어 사전을 이용하여 가장 유사한 핵심어를 찾아 그 핵심어의 의미를 부착한다. 중의성을 가진 고빈도 핵심어 3개와 저빈도 핵심어 3개를 말뭉치에서 선정하여 제안 시스템의 성능을 평가하였다. 실험결과, 수동으로 구축한 말뭉치를 사용하였을 때 약 54.4%의 정확도를 얻었고, 반자동으로 확장한 말뭉치를 사용하였을 때 약 50.0%의 정확도를 얻었다.

Keywords

JBCRJM_2019_v8n5_213_f0001.png 이미지

Fig.1. An Example of the Dialogue Corpus

JBCRJM_2019_v8n5_213_f0002.png 이미지

Fig. 2. The Process of Building Dictionaries for the Proposed System

JBCRJM_2019_v8n5_213_f0003.png 이미지

Fig. 3. Examples of an Ambiguous Keyword

JBCRJM_2019_v8n5_213_f0004.png 이미지

Fig. 4. An Example of Building the Context Words Set

JBCRJM_2019_v8n5_213_f0005.png 이미지

Fig. 5. The Process of Determining Semantic Mark Using the Dictionaries

JBCRJM_2019_v8n5_213_f0006.png 이미지

Fig. 6. An Example of Producing a Context Vector

Table 1. Dataset for Word2Vec Modeling

JBCRJM_2019_v8n5_213_t0001.png 이미지

Table 2. The Manual Tagged Set and the Expanded Set

JBCRJM_2019_v8n5_213_t0002.png 이미지

Table 3. A Evaluation Result of New Keyword

JBCRJM_2019_v8n5_213_t0003.png 이미지

Table 4. Number of Cases by Automatic Tagging

JBCRJM_2019_v8n5_213_t0004.png 이미지

Table 5. Entries of Keyword-Sematic Mark Dictionary

JBCRJM_2019_v8n5_213_t0005.png 이미지

Table 6. Frequency of Semantic Marks and the Result of the System

JBCRJM_2019_v8n5_213_t0006.png 이미지

Table 7. Context Words Frequency of ‘나다(nada)’

JBCRJM_2019_v8n5_213_t0007.png 이미지

References

  1. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, "Efficient Estimation of Word Representations in Vector Space," arXiv:1301.3781, 2013.
  2. Wansu Kim and Cheolyoung Ock, “Korean Semantic Role Labeling Using Case Frame Dictionary and Subcategorization,” Journal of KIISE, Vol. 43, No. 12, pp. 1376-1384, 2016. https://doi.org/10.5626/JOK.2016.43.12.1376
  3. Jangseong Bae, Changki Lee, and Soojong Lim, "Korean Semantic Role Labeling using Deep Learning," Proc. of the KIISE Korea Computer Congress 2015, pp. 690-692, 2015.
  4. Martha Palmer, Shijong Ryu, Jinyoung Choi, Sinwon Yoon, and Yeongmi Jeon, Korean Propbank, [Online]. Available:http://catalog.ldc.upenn.edu/LDC2006T03.
  5. Michael Lesk, "Automatic Sense Disambiguation Using Machine Readable Dictionaries: How to Tell a Pine Cone from an Ice Cream Cone," in Proceedings of the 5th Annual International Conference on Systems Documentation, 1986.
  6. G. A. Miller, "WordNet : An On-Line Lexical Database," International Journal of Lexicography, Jan. 1990.
  7. Sangwook Kang, Minho Kim, Hyukchul Kwon, Sungkyu Jeon, and Juhyun Oh, “Word Sense Disambiguation of Predicate using Sejong Electronic Dictionary and KorLex,” KIISE Transactions on Computing Practices, Vol. 21, No. 7, pp. 500-505, 2015. https://doi.org/10.5626/KTCP.2015.21.7.500
  8. Joonchoul Shin and Cheolyoung Ock, “A Stage Transition Model for Korean Part-of-Speech and Homograph Tagging,” Journal of KIISE,, Vol. 39, No. 11, pp. 889-901, 2012.
  9. H. Schutze, "Automatic Word Sense Discrimination," Computational Linguistics, Vol. 24, No. 1, 1998.
  10. Yongmin Park and Jaesung Lee, “Word Sense Disambiguation using Korean Word Space Model,” Journal of The Korea Contents Association, Vol. 12, No. 6, pp. 41-47, 2012. https://doi.org/10.5392/JKCA.2012.12.06.041
  11. Hanjo Jeong and Byeonghwa Park, "Korean Word Sense Disambiguation using Dictionary and Corpus," Journal of Intelligence and Information Systems, Vol. 21, pp. 1-13, 2015.
  12. Sangyun Kim and Soowon Lee, “Automatic Extraction of Alternative Word Candidates using the Word2vec model,” Korean Institute of Information Scientists and Engineers, Vol. 2015, No. 12, pp. 769-771, 2015.
  13. Junhyeok Park, and Songwook Lee, “Word Sense Classification Using Support Vector Machines,” KIPS Tr., Vol. 5, No. 11, pp. 563-568, 2016.
  14. Kongjoo Lee and Songwook Lee, "Error-driven Noun-Connection Rule Extraction for Morphological Analysis", Journal of the Korean society of Marine Engineering, Vol. 36, No. 8, pp. 1123-1128, 2012. https://doi.org/10.5916/jkosme.2012.36.8.1123