Browse > Article
http://dx.doi.org/10.3745/KIPSTB.2004.11B.3.387

Construction of Linearly Aliened Corpus Using Unsupervised Learning  

Lee, Kong-Joo (경인여자대학 컴퓨터정보기술학부)
Kim, Jae-Hoon (한국해양대학교 컴퓨터공학과)
Abstract
In this paper, we propose a modified unsupervised linear alignment algorithm for building an aligned corpus. The original algorithm inserts null characters into both of two aligned strings (source string and target string), because the two strings are different from each other in length. This can cause some difficulties like the search space explosion for applications using the aligned corpus with null characters and no possibility of applying to several machine learning algorithms. To alleviate these difficulties, we modify the algorithm not to contain null characters in the aligned source strings. We have shown the usability of our approach by applying it to different areas such as Korean-English back-trans literation, English grapheme-phoneme conversion, and Korean morphological analysis.
Keywords
Unsupervised Learning; Edit Distance; Linear Alignment; Korean-English(back) Transliteration; Grapheme-Phoneme Conversion; Korean Morphological Segmentation;
Citations & Related Records
연도 인용수 순위
  • Reference
1 The use of tree-trellis search for large-vocabulary mandarin polysyllabic word speech recognition /
[ Huang, E.-F.;Soong, F. K.;Wang, H.-C. ] / Computer Speech and Language   DOI   ScienceOn
2 Krogh, A., Brown, M., Mian, I. S., Sjolander, K. and Haussler, D. 'Hidden Markov models in computational biology: Applications to protein modeling,' Journal of Molecular Biology, 235, pp.1501-1531, 1994   DOI   ScienceOn
3 Allison, L., Powell, D. and Dix, T. I. 'Comptession and Approximate Matching,' The Computer Journal, 42(1), pp. 1-10, 1999   DOI   ScienceOn
4 Breimer, E. A. A Learning Approach for Designing Dynamic Programming Algorithms, http://www.cs.rpi.edu/~breime/slide/, 2000
5 이재성, 다국어 정보검색을 위한 영-한 음차 표기 및 복원 모델, 한국과학기술원 박사학위논문, 1999
6 국립국어연구원, 표준대국어사전, (주)두산동아, 2000
7 CMU, CMU Pronouncing Dictionary, http://www.speech.cs.cmu.edu/speech/
8 이성진, Two-Level 한국어 형태소 해석, 한국과학기술원, 전산학과, 석사학위 논문, 1992
9 Antworth, E. L., PC-KIMMO : A Two-level Processor for Morphological Analysis, Summer Institute of Linguistics, 1990
10 김재훈, 김길창, 한국어에서의 품사 부착 말뭉치의 작성 요령 : KAIST 말뭉치, 한국과학기술원, 전산학과, CS-TR-95-99, 1995
11 Jurafsky, A. and Martin, J. H., An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice-Hall, 2000
12 Mitchell, T. M. Machine Learning, McGraw-Hill, 1997
13 Rainber, L. R., 'A tutorial on hidden Markov models and selected application in speech recognition,' Proceedings of the IEEE, 77(2), pp.257-286. 1989   DOI   ScienceOn
14 Huang, E.-F., Soong, F. K., and Wang, H.-C., 'The use of tree-trellis search for large-vocabulary mandarin polysyllabic word speech recognition,' Computer Speech and Language, 8, pp.39-50, 1994   DOI   ScienceOn
15 Marcus, M. P., Santorini, B. and Marcinkiewicz, M. A. 'Building a large annotated corpus of English: The Penn Treebank,' Computational Linguistics, 19(2), pp.313-330, 1993
16 국립국어연구원, 21세기 세종계획 성과발표 및 토론회 자료집, 2004
17 Manning, C. D. and Schutze, H. Foundations of Statistical Natural Language Processing, The MIT Press, 1999
18 Ristad, E., Yianilos, P., 'Learning String Edit Distance,' IEEE Tr. on Pattern Analysis and Machine Intelligence, 20(2), pp.522-532, 1998   DOI   ScienceOn
19 Qualian J. R., C4.5 : Programs for Machine Learning, San Mateo, CA : Morgan Kaufmann Publishers, 1993
20 Burges, C. J. C., 'A tutorial on support vector machines for pattern recognition,' Knowledge Discovery and Data Mining, 2(2), 1998   DOI   ScienceOn