Browse > Article
http://dx.doi.org/10.5916/jkosme.2013.37.2.227

Detecting and correcting errors in Korean POS-tagged corpora  

Choi, Myung-Gil (금호마린테크)
Seo, Hyung-Won (한국한국해양대학교 컴퓨터공학과)
Kwon, Hong-Seok (한국한국해양대학교 컴퓨터공학과)
Kim, Jae-Hoon (한국해양대학교 IT공학부)
Abstract
The quality of the part-of-speech (POS) annotation in a corpus plays an important role in developing POS taggers. There, however, are several kinds of errors in Korean POS-tagged corpora like Sejong Corpus. Such errors are likely to be various like annotation errors, spelling errors, insertion and/or deletion of unexpected characters. In this paper, we propose a method for detecting annotation errors using error patterns, and also develop a tool for effectively correcting them. Overall, based on the proposed method, we have hand-corrected annotation errors in Sejong POS Tagged Corpus using the developed tool. As the result, it is faster at least 9 times when compared without using any tools. Therefore we have observed that the proposed method is effective for correcting annotation errors in POS-tagged corpus.
Keywords
POS-tagged corpus; Error correction; Error detection; Corpus annotation/correction tool;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 J.-H. Kim and G. C. Kim, Guideline on Building a Korean Part-of-Speech Tagged Corpus: KAIST Corpus, Technical Report CS-TR-95-99, Department of Computer Science, KAIST, 1995 (in Korean).
2 C.-H. Han and N.-R. Han, Part of Speech Tagging Guidelines for Penn Korean Treebank, Technical Report IRCS Report 01-09, Institute for Research in Cognitive Science, University of Pennsylvania, 2001.
3 H.-G. Kim, 21st Century Sejong Project - Construction of the Primary Data of the Korean Language, Research Report NIKL 2007-01-10, National Institute of the Korean Language, 2007 (in Korean).
4 M. Lee, H. Jung, W.-K. Sung, and D.-I. Park, "Verification of POS tagged corpus,", Proceedings of the 17th Annual Conference on Human and Cognitive Language Technology, pp. 145-150, 2005 (in Korean).
5 J.-H. Kim, H.-W. Seo, K.-H. Jeon, and M.-G. Choi, "Error correction methods for Sejong corpus," Proceedings of the KOSME Spring Conference, pp. 435-436. 2010 (in Korean).
6 M. Dickinson, Error Detection and Correction in Annotated Corpora. Ph.D. Thesis, The Ohio State University, 2005.
7 H. Loftsson, "Correcting a PoS-tagged corpus using three complementary methods," Proceedings of the 12th Conference of the European Chapter of the ACL, pp. 523-531, 2009.
8 H. Loftsson, J. H. Yngvason, S. Helgadottir, and E. Rognvaldsson, "Developing a POS-tagged corpus using existing tools," Proceedings of the 12th Conference of the European Chapter of the ACL, pages 523-531, 2009.
9 H. van Halteren "The detection of inconsistency in manually tagged text," Proceedings of the 2nd Workshop on Linguistically Interpreted Corpora, 2000.
10 M. Dickinson and W. D. Meurers, "Detecting errors in part-of-speech annotation," Proceedings of the 10th conference on European chapter of the Association for Computational Linguistics pp. 107-114. 2003.
11 E. Eskin, "Automatic corpus correction with anomaly detection," Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics pp. 148-153, 2000.
12 T. Nakagawa and Y. Matsumoto, "Detecting errors in corpora using support vector machines," Proceedings of the 17th International Conference on Computational Linguistics, pp. 709-715, 2002.
13 T. Ule and K. Simov, "Unexpected productions may well be errors", Proceedings of 4th International Conference on Language Resources and Evaluation, pp. 1795-1798, 2004.
14 Q. Ma, B.-L. Lu, M. Murata, M. Ichikawa and H. Isahara, "On-line error detection of annotated corpus using modular neural networks," Proceedings of the International Conference on Artificial Neural Networks, pp. 1185-1192, 2001
15 R. Reidsma, K. Tomanek, U. Hahn, and A. Rappoport, "Multi-task active learning for linguistic annotations," Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 861-869, 2008.
16 B. G. Chang, K. J. Lee and G. C. Kim, "Design and implement of tree tagging workbench to build a large tree tagged corpus of Korean," Proceedings of the 9th Annual Conference on Human and Cognitive Language Technology, pp. 421-429, 1997 (in Korean).
17 Y.-H. Noh, H. A. Lee, and G. C. Kim, "A workbench for domain adaptation of an MT lexicon with a target domain corpus," Proceedings of the 12th Annual Conference on Human and Cognitive Language Technology, pp. 163-168, 2000 (in Korean).
18 T. Morton and J. LaCivita, "WordFreak: An open tool for linguistic annotation," Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 17-18, 2003.
19 J.-H. Kim and E.-J. Park, "PPEditor: Semi-automatic annotation tool for Korean dependency structure," The Transaction of the Korean Information Processing Society, vol. 13-B, no. 1, pp. 63-70, 2006 (in Korean).   과학기술학회마을   DOI   ScienceOn
20 D. Day, J. Aberdeen, L. Hirschman, R. Kozierok, P. Robinson, and M. Vilain, "Mixed-initiative development of language processing systems", Proceedings of the Applied Natural Language Processing Conference, pp. 348-355, 1997.
21 T. Brants and O. Plaehn, "Interactive corpus annotation," Proceedings of the 2nd International Conference on Language Resources and Engineering, pp. 453-459, 2000.
22 S. Chung, T. Kim, D. Hwang, and D.-I. Park, "Morphological generation system in English-Korean Machine Translation System MATES/EK," Proceedings of the Workshop on Research Projects of the Ministry of Science and Technology, pp. 10-13, 1990 (in Korean).
23 U. C. Choi, D. U. An, K.-S. Choi, and G. C. Kim, "Design and implementation of Korean generator for English-Korean Machine Translation," Proceedings of the Autumn Conference of KISS, vol. 17, no. 2, pp. 221-224, 1990 (in Korean).
24 H.-W. Seo, M.-K. Choi, Y.-R. Nam, H.-S. Kwon, and J.-H. Kim, "TagBench : A tool for building large corpora," Proceedings of the 24th Annual Conference on Human and Cognitive Language Technology, pp. 126-131, 2012 (in Korean).
25 M.-G. Choi, Developing a Tool for Detecting and Correcting Errors in Sejong POS Tagged Corpus, Master's Thesis, Department of Computer Engineering, Korea Maritime University, 2012 (in Korean).
26 J.-H. Kim, A Study on a Corpus Construction Tool for Machine Translation, Research Report, Electronics and Telecommunications Research Institute (ETRI), 2012.