DOI QR코드

DOI QR Code

Detecting Errors in POS-Tagged Corpus on XGBoost and Cross Validation

XGBoost와 교차검증을 이용한 품사부착말뭉치에서의 오류 탐지

  • 최민석 (한국해양대학교 컴퓨터공학과) ;
  • 김창현 (한국전자통신연구원) ;
  • 박호민 (한국해양대학교 컴퓨터공학과) ;
  • 천민아 (한국해양대학교 컴퓨터공학과) ;
  • 윤호 (한국해양대학교 컴퓨터공학과) ;
  • 남궁영 (한국해양대학교 컴퓨터공학과) ;
  • 김재균 (한국해양대학교 컴퓨터공학과) ;
  • 김재훈 (한국해양대학교 컴퓨터공학과)
  • Received : 2020.04.02
  • Accepted : 2020.04.25
  • Published : 2020.07.31

Abstract

Part-of-Speech (POS) tagged corpus is a collection of electronic text in which each word is annotated with a tag as the corresponding POS and is widely used for various training data for natural language processing. The training data generally assumes that there are no errors, but in reality they include various types of errors, which cause performance degradation of systems trained using the data. To alleviate this problem, we propose a novel method for detecting errors in the existing POS tagged corpus using the classifier of XGBoost and cross-validation as evaluation techniques. We first train a classifier of a POS tagger using the POS-tagged corpus with some errors and then detect errors from the POS-tagged corpus using cross-validation, but the classifier cannot detect errors because there is no training data for detecting POS tagged errors. We thus detect errors by comparing the outputs (probabilities of POS) of the classifier, adjusting hyperparameters. The hyperparameters is estimated by a small scale error-tagged corpus, in which text is sampled from a POS-tagged corpus and which is marked up POS errors by experts. In this paper, we use recall and precision as evaluation metrics which are widely used in information retrieval. We have shown that the proposed method is valid by comparing two distributions of the sample (the error-tagged corpus) and the population (the POS-tagged corpus) because all detected errors cannot be checked. In the near future, we will apply the proposed method to a dependency tree-tagged corpus and a semantic role tagged corpus.

품사부착말뭉치는 품사정보를 부착한 말뭉치를 말하며 자연언어처리 분야에서 다양한 학습말뭉치로 사용된다. 학습말뭉치는 일반적으로 오류가 없다고 가정하지만, 실상은 다양한 오류를 포함하고 있으며, 이러한 오류들은 학습된 시스템의 성능을 저하시키는 요인이 된다. 이러한 문제를 다소 완화시키기 위해서 본 논문에서는 XGBoost와 교차 검증을 이용하여 이미 구축된 품사부착말뭉치로부터 오류를 탐지하는 방법을 제안한다. 제안된 방법은 먼저 오류가 포함된 품사부착말뭉치와 XGBoost를 사용해서 품사부착기를 학습하고, 교차검증을 이용해서 품사오류를 검출한다. 그러나 오류가 부착된 학습말뭉치가 존재하지 않으므로 일반적인 분류기로서 오류를 검출할 수 없다. 따라서 본 논문에서는 매개변수를 조절하면서 학습된 품사부착기의 출력을 비교함으로써 오류를 검출한다. 매개변수를 조절하기 위해서 본 논문에서는 작은 규모의 오류부착말뭉치를 이용한다. 이 말뭉치는 오류 검출 대상의 전체 말뭉치로부터 임의로 추출된 것을 전문가에 의해서 오류가 부착된 것이다. 본 논문에서는 성능 평가의 척도로 정보검색에서 널리 사용되는 정밀도와 재현율을 사용하였다. 또한 모집단의 모든 오류 후보를 수작업으로 확인할 수 없으므로 표본 집단과 모집단의 오류 분포를 비교하여 본 논문의 타당성을 보였다. 앞으로 의존구조부착 말뭉치와 의미역 부착말뭉치에서 적용할 계획이다.

Keywords

References

  1. J. Kim and G. Kim, Building a Korean Part-of-speech Tagged Corpus: KAIST Corpus, CS-TR-95-99, 1995. (in Korean).
  2. M. Lee, H. Jung, W. Sung, and D. Park, "Verification of POS Tagged Corpus," in Proceedings. of the 31th Annual Conference on Human and Cognitive Language Technology, pp.145-150, 2005. (in Korean).
  3. M. Choi, H. Seo, H. Kwon, and J. Kim, "Detecting and Correcting Errors in Korean POS-tagged Corpora," Journal of the Korean Society of Marine Engineering, Vol.37, No.1, pp.227-235, 2013 (in Korean).
  4. E. Eskin, "Detecting Errors Within a Corpus using Anomaly Detection," in Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference, pp.148-153, 2000.
  5. Q. Ma, B. Lu, M. Murata, M. Ichikawa, and H. Isahara, "On-line Error Detection of Annotated Corpus using Modular Neural Networks," Lecture Notes in Computer Science, Vol.2130, pp.1185-1195, 2001.
  6. T. Nakagawa and Y. Matsumoto, "Detecting Errors in Corpora using Support Vector Machines," in Proceedings of the 19th International Conference on Computational Linguistics, pp.1-7, 2002.
  7. M. Dickinson, "Detection of Annotation Errors in Corpora," Language and Linguistics Compass, Vol.9, No.3, pp. 119-138, 2015. https://doi.org/10.1111/lnc3.12129
  8. V. Chandola, A. Banerjee, and V. Kumar, "Anomaly Detection: Survey," in Proceedings of ACM Computing Surveys, Vol.41, No.3, p.15, 2009.
  9. S. Bybers and A. E. Raftery, "Nearest-neighbor Clutter Removal for Estimating Features in Spatial Point," in Proceedings Journal of the American Statistical Association, Vol.93, No.442, pp.572-584, 1998.
  10. A. Agovic, A. Banerjee, A. R. Ganguly, and V. Protopescu, "Anomaly Detection in Transportation Corridors using Manifold Embedding," in Proceedings of the 1st International Workshop on Knowledge Discovery from Sensor Data, pp.435-455, 2007.
  11. D. Yu, G. Sheikholeslami, and A. Zhang, "Findout: Finding Outliers in Very Large Datasets," in Proceedings of Knowledge and Information Systems, Vol.4, No.4, pp. 387-412, 2002. https://doi.org/10.1007/s101150200013
  12. I. Rehbein, "POS Error Detection in Automatically Annotated Corpora," in Proceedings of the 8th Linguistic Annotation Workshop, pp.20-28, 2014.
  13. C. Tianqi and G. Carlos, "XGBoost : A Scalable Tree Boosting System," in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol.16, pp.785-794, 2016.
  14. T. G. Thomas, "Ensemble Methods in Machine Learning," in Proceedings of Multiple Classifier Systems. MCS 2000. Lecture Notes in Computer Science, Vol. 1857, 2000.
  15. L. Breiman, "Random Forests," Machine Learning, Vol.45, pp.5-32, 2001. https://doi.org/10.1023/A:1010933404324
  16. J.-H. Kim, H.-W. Seo, G.-H. Jeon, and M.-G. Choi, "Error Correction Methods for Sejong Corpus," in Proceedings of the Joint Conference on Marine Engineering and Navigation and Port Research, pp.435-436, 2010 (in Korean).
  17. N. Kang, E. M. van Mulligen, and J. A. Kors, "Training Text Chunkers on a Silver Standard Corpus: Can Silver Replace Gold?," BMC Bioinformatics, Vol.13, No.1, pp.17-22, 2012. https://doi.org/10.1186/1471-2105-13-17
  18. CORPUS, Sejong, 21st Century Sejong Project, The National Institue of the Korean Language, 2010 (in Korean).
  19. P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching Word Vectors with Subword Information," Transactions of the Association for Computational Linguistics, Vol.5, pp.135-146, 2017. https://doi.org/10.1162/tacl_a_00051
  20. M. Cheon, C. Kim, J. Kim, E. Noh, K. Sung, and M. Song, "Automated Scoring System for Korean Short-answer Question using Predictability and Unanimity," KIPS Transaction Software and Data Engineering, Vol.5, No.11, pp.527-534, 2016. https://doi.org/10.3745/KTSDE.2016.5.11.527
  21. J. Hong and J. Cha, "Error Correction of Sejong Morphological Annotation Corpora using Part-of-speech tagger and Frequency Information," Journal of KISS : Software and Applications, Vol.40, No.7, pp.417-428, 2013.
  22. M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, "Building a Large Annotated Corpus of English: The Penn Treebank," Computational Linguistics, Vol.19, No.2. pp. 313-330, 1993.
  23. S. Kullback, Information Theory and Statistics, Dover Publications, 1968.