Browse > Article
http://dx.doi.org/10.3745/KTSDE.2020.9.7.221

Detecting Errors in POS-Tagged Corpus on XGBoost and Cross Validation  

Choi, Min-Seok (한국해양대학교 컴퓨터공학과)
Kim, Chang-Hyun (한국전자통신연구원)
Park, Ho-Min (한국해양대학교 컴퓨터공학과)
Cheon, Min-Ah (한국해양대학교 컴퓨터공학과)
Yoon, Ho (한국해양대학교 컴퓨터공학과)
Namgoong, Young (한국해양대학교 컴퓨터공학과)
Kim, Jae-Kyun (한국해양대학교 컴퓨터공학과)
Kim, Jae-Hoon (한국해양대학교 컴퓨터공학과)
Publication Information
KIPS Transactions on Software and Data Engineering / v.9, no.7, 2020 , pp. 221-228 More about this Journal
Abstract
Part-of-Speech (POS) tagged corpus is a collection of electronic text in which each word is annotated with a tag as the corresponding POS and is widely used for various training data for natural language processing. The training data generally assumes that there are no errors, but in reality they include various types of errors, which cause performance degradation of systems trained using the data. To alleviate this problem, we propose a novel method for detecting errors in the existing POS tagged corpus using the classifier of XGBoost and cross-validation as evaluation techniques. We first train a classifier of a POS tagger using the POS-tagged corpus with some errors and then detect errors from the POS-tagged corpus using cross-validation, but the classifier cannot detect errors because there is no training data for detecting POS tagged errors. We thus detect errors by comparing the outputs (probabilities of POS) of the classifier, adjusting hyperparameters. The hyperparameters is estimated by a small scale error-tagged corpus, in which text is sampled from a POS-tagged corpus and which is marked up POS errors by experts. In this paper, we use recall and precision as evaluation metrics which are widely used in information retrieval. We have shown that the proposed method is valid by comparing two distributions of the sample (the error-tagged corpus) and the population (the POS-tagged corpus) because all detected errors cannot be checked. In the near future, we will apply the proposed method to a dependency tree-tagged corpus and a semantic role tagged corpus.
Keywords
Error Detection; POS-tagged Corpus; XGBoost; Cross-validation;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 J. Kim and G. Kim, Building a Korean Part-of-speech Tagged Corpus: KAIST Corpus, CS-TR-95-99, 1995. (in Korean).
2 M. Lee, H. Jung, W. Sung, and D. Park, "Verification of POS Tagged Corpus," in Proceedings. of the 31th Annual Conference on Human and Cognitive Language Technology, pp.145-150, 2005. (in Korean).
3 M. Choi, H. Seo, H. Kwon, and J. Kim, "Detecting and Correcting Errors in Korean POS-tagged Corpora," Journal of the Korean Society of Marine Engineering, Vol.37, No.1, pp.227-235, 2013 (in Korean).
4 E. Eskin, "Detecting Errors Within a Corpus using Anomaly Detection," in Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference, pp.148-153, 2000.
5 Q. Ma, B. Lu, M. Murata, M. Ichikawa, and H. Isahara, "On-line Error Detection of Annotated Corpus using Modular Neural Networks," Lecture Notes in Computer Science, Vol.2130, pp.1185-1195, 2001.
6 T. Nakagawa and Y. Matsumoto, "Detecting Errors in Corpora using Support Vector Machines," in Proceedings of the 19th International Conference on Computational Linguistics, pp.1-7, 2002.
7 M. Dickinson, "Detection of Annotation Errors in Corpora," Language and Linguistics Compass, Vol.9, No.3, pp. 119-138, 2015.   DOI
8 V. Chandola, A. Banerjee, and V. Kumar, "Anomaly Detection: Survey," in Proceedings of ACM Computing Surveys, Vol.41, No.3, p.15, 2009.
9 S. Bybers and A. E. Raftery, "Nearest-neighbor Clutter Removal for Estimating Features in Spatial Point," in Proceedings Journal of the American Statistical Association, Vol.93, No.442, pp.572-584, 1998.
10 A. Agovic, A. Banerjee, A. R. Ganguly, and V. Protopescu, "Anomaly Detection in Transportation Corridors using Manifold Embedding," in Proceedings of the 1st International Workshop on Knowledge Discovery from Sensor Data, pp.435-455, 2007.
11 D. Yu, G. Sheikholeslami, and A. Zhang, "Findout: Finding Outliers in Very Large Datasets," in Proceedings of Knowledge and Information Systems, Vol.4, No.4, pp. 387-412, 2002.   DOI
12 I. Rehbein, "POS Error Detection in Automatically Annotated Corpora," in Proceedings of the 8th Linguistic Annotation Workshop, pp.20-28, 2014.
13 C. Tianqi and G. Carlos, "XGBoost : A Scalable Tree Boosting System," in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol.16, pp.785-794, 2016.
14 T. G. Thomas, "Ensemble Methods in Machine Learning," in Proceedings of Multiple Classifier Systems. MCS 2000. Lecture Notes in Computer Science, Vol. 1857, 2000.
15 L. Breiman, "Random Forests," Machine Learning, Vol.45, pp.5-32, 2001.   DOI
16 J.-H. Kim, H.-W. Seo, G.-H. Jeon, and M.-G. Choi, "Error Correction Methods for Sejong Corpus," in Proceedings of the Joint Conference on Marine Engineering and Navigation and Port Research, pp.435-436, 2010 (in Korean).
17 N. Kang, E. M. van Mulligen, and J. A. Kors, "Training Text Chunkers on a Silver Standard Corpus: Can Silver Replace Gold?," BMC Bioinformatics, Vol.13, No.1, pp.17-22, 2012.   DOI
18 CORPUS, Sejong, 21st Century Sejong Project, The National Institue of the Korean Language, 2010 (in Korean).
19 P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching Word Vectors with Subword Information," Transactions of the Association for Computational Linguistics, Vol.5, pp.135-146, 2017.   DOI
20 M. Cheon, C. Kim, J. Kim, E. Noh, K. Sung, and M. Song, "Automated Scoring System for Korean Short-answer Question using Predictability and Unanimity," KIPS Transaction Software and Data Engineering, Vol.5, No.11, pp.527-534, 2016.   DOI
21 J. Hong and J. Cha, "Error Correction of Sejong Morphological Annotation Corpora using Part-of-speech tagger and Frequency Information," Journal of KISS : Software and Applications, Vol.40, No.7, pp.417-428, 2013.
22 M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, "Building a Large Annotated Corpus of English: The Penn Treebank," Computational Linguistics, Vol.19, No.2. pp. 313-330, 1993.
23 S. Kullback, Information Theory and Statistics, Dover Publications, 1968.