Detecting Errors in POS-Tagged Corpus on XGBoost and Cross Validation |
Choi, Min-Seok
(한국해양대학교 컴퓨터공학과)
Kim, Chang-Hyun (한국전자통신연구원) Park, Ho-Min (한국해양대학교 컴퓨터공학과) Cheon, Min-Ah (한국해양대학교 컴퓨터공학과) Yoon, Ho (한국해양대학교 컴퓨터공학과) Namgoong, Young (한국해양대학교 컴퓨터공학과) Kim, Jae-Kyun (한국해양대학교 컴퓨터공학과) Kim, Jae-Hoon (한국해양대학교 컴퓨터공학과) |
1 | J. Kim and G. Kim, Building a Korean Part-of-speech Tagged Corpus: KAIST Corpus, CS-TR-95-99, 1995. (in Korean). |
2 | M. Lee, H. Jung, W. Sung, and D. Park, "Verification of POS Tagged Corpus," in Proceedings. of the 31th Annual Conference on Human and Cognitive Language Technology, pp.145-150, 2005. (in Korean). |
3 | M. Choi, H. Seo, H. Kwon, and J. Kim, "Detecting and Correcting Errors in Korean POS-tagged Corpora," Journal of the Korean Society of Marine Engineering, Vol.37, No.1, pp.227-235, 2013 (in Korean). |
4 | E. Eskin, "Detecting Errors Within a Corpus using Anomaly Detection," in Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference, pp.148-153, 2000. |
5 | Q. Ma, B. Lu, M. Murata, M. Ichikawa, and H. Isahara, "On-line Error Detection of Annotated Corpus using Modular Neural Networks," Lecture Notes in Computer Science, Vol.2130, pp.1185-1195, 2001. |
6 | T. Nakagawa and Y. Matsumoto, "Detecting Errors in Corpora using Support Vector Machines," in Proceedings of the 19th International Conference on Computational Linguistics, pp.1-7, 2002. |
7 | M. Dickinson, "Detection of Annotation Errors in Corpora," Language and Linguistics Compass, Vol.9, No.3, pp. 119-138, 2015. DOI |
8 | V. Chandola, A. Banerjee, and V. Kumar, "Anomaly Detection: Survey," in Proceedings of ACM Computing Surveys, Vol.41, No.3, p.15, 2009. |
9 | S. Bybers and A. E. Raftery, "Nearest-neighbor Clutter Removal for Estimating Features in Spatial Point," in Proceedings Journal of the American Statistical Association, Vol.93, No.442, pp.572-584, 1998. |
10 | A. Agovic, A. Banerjee, A. R. Ganguly, and V. Protopescu, "Anomaly Detection in Transportation Corridors using Manifold Embedding," in Proceedings of the 1st International Workshop on Knowledge Discovery from Sensor Data, pp.435-455, 2007. |
11 | D. Yu, G. Sheikholeslami, and A. Zhang, "Findout: Finding Outliers in Very Large Datasets," in Proceedings of Knowledge and Information Systems, Vol.4, No.4, pp. 387-412, 2002. DOI |
12 | I. Rehbein, "POS Error Detection in Automatically Annotated Corpora," in Proceedings of the 8th Linguistic Annotation Workshop, pp.20-28, 2014. |
13 | C. Tianqi and G. Carlos, "XGBoost : A Scalable Tree Boosting System," in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol.16, pp.785-794, 2016. |
14 | T. G. Thomas, "Ensemble Methods in Machine Learning," in Proceedings of Multiple Classifier Systems. MCS 2000. Lecture Notes in Computer Science, Vol. 1857, 2000. |
15 | L. Breiman, "Random Forests," Machine Learning, Vol.45, pp.5-32, 2001. DOI |
16 | J.-H. Kim, H.-W. Seo, G.-H. Jeon, and M.-G. Choi, "Error Correction Methods for Sejong Corpus," in Proceedings of the Joint Conference on Marine Engineering and Navigation and Port Research, pp.435-436, 2010 (in Korean). |
17 | N. Kang, E. M. van Mulligen, and J. A. Kors, "Training Text Chunkers on a Silver Standard Corpus: Can Silver Replace Gold?," BMC Bioinformatics, Vol.13, No.1, pp.17-22, 2012. DOI |
18 | CORPUS, Sejong, 21st Century Sejong Project, The National Institue of the Korean Language, 2010 (in Korean). |
19 | P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching Word Vectors with Subword Information," Transactions of the Association for Computational Linguistics, Vol.5, pp.135-146, 2017. DOI |
20 | M. Cheon, C. Kim, J. Kim, E. Noh, K. Sung, and M. Song, "Automated Scoring System for Korean Short-answer Question using Predictability and Unanimity," KIPS Transaction Software and Data Engineering, Vol.5, No.11, pp.527-534, 2016. DOI |
21 | J. Hong and J. Cha, "Error Correction of Sejong Morphological Annotation Corpora using Part-of-speech tagger and Frequency Information," Journal of KISS : Software and Applications, Vol.40, No.7, pp.417-428, 2013. |
22 | M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, "Building a Large Annotated Corpus of English: The Penn Treebank," Computational Linguistics, Vol.19, No.2. pp. 313-330, 1993. |
23 | S. Kullback, Information Theory and Statistics, Dover Publications, 1968. |