Construction of Linearly Aliened Corpus Using Unsupervised Learning

Lee, Kong-Joo;Kim, Jae-Hoon;

doi:10.3745/KIPSTB.2004.11B.3.387

The KIPS Transactions:PartB (정보처리학회논문지B)

Volume 11B Issue 3
/
Pages.387-394
/
2004
/
1598-284X(pISSN)

Korea Information Processing Society (한국정보처리학회)

DOI QR Code

Construction of Linearly Aliened Corpus Using Unsupervised Learning

자율 학습을 이용한 선형 정렬 말뭉치 구축

이공주 (경인여자대학 컴퓨터정보기술학부) ;
김재훈 (한국해양대학교 컴퓨터공학과)

Published : 2004.06.01

https://doi.org/10.3745/KIPSTB.2004.11B.3.387 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

In this paper, we propose a modified unsupervised linear alignment algorithm for building an aligned corpus. The original algorithm inserts null characters into both of two aligned strings (source string and target string), because the two strings are different from each other in length. This can cause some difficulties like the search space explosion for applications using the aligned corpus with null characters and no possibility of applying to several machine learning algorithms. To alleviate these difficulties, we modify the algorithm not to contain null characters in the aligned source strings. We have shown the usability of our approach by applying it to different areas such as Korean-English back-trans literation, English grapheme-phoneme conversion, and Korean morphological analysis.

본 논문에서는 자을 선형 정렬 알고리즘을 이용하여 선형 정렬 말뭉치를 구축하는 방법을 제안한다. 기존의 자율 선형 정렬 알고리즘을 이용하여 선형 정렬 말뭉치를 구축할 경우, 두 문자열의 길이가 서로 다르면 정렬된 두 문자열(입력열과 출력열)에 모두 공백문자가 나타난다. 이 방법을 그대로 사용하면 정렬 말뭉치의 구축은 용이하나 정렬된 말뭉치를 이용하는 응용 시스템에서는 탐색 공간이 기하급수적으로 늘어날 뿐 아니라 구축된 정렬 말뭉치는 다양한 기계학습 방법에 두루 사용될 수 없다는 문제가 있다. 본 논문에서는 이들 문제를 최소화하기 위해서 입력열에는 공백문자가 나타나지 않도록 기존의 자을 선형 정렬 알고리즘을 수정하였다. 이 알고리즘을 이용해서 한영 음차 표기 및 복원, 영어 단어의 발음 생성, 영어 발음의 단어 생성, 한국어 형태소 분리 및 복원을 위한 정렬 말뭉치를 구축하였으며, 간단한 실험을 통해, 그들의 실용성을 입증해 보였다.

Keywords

References

국립국어연구원, 21세기 세종계획 성과발표 및 토론회 자료집, 2004
Manning, C. D. and Schutze, H. Foundations of Statistical Natural Language Processing, The MIT Press, 1999
Marcus, M. P., Santorini, B. and Marcinkiewicz, M. A. 'Building a large annotated corpus of English: The Penn Treebank,' Computational Linguistics, 19(2), pp.313-330, 1993
Jurafsky, A. and Martin, J. H., An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice-Hall, 2000
Ristad, E., Yianilos, P., 'Learning String Edit Distance,' IEEE Tr. on Pattern Analysis and Machine Intelligence, 20(2), pp.522-532, 1998 https://doi.org/10.1109/34.682181
Qualian J. R., C4.5 : Programs for Machine Learning, San Mateo, CA : Morgan Kaufmann Publishers, 1993
Mitchell, T. M. Machine Learning, McGraw-Hill, 1997
Burges, C. J. C., 'A tutorial on support vector machines for pattern recognition,' Knowledge Discovery and Data Mining, 2(2), 1998 https://doi.org/10.1023/A:1009715923555
Krogh, A., Brown, M., Mian, I. S., Sjolander, K. and Haussler, D. 'Hidden Markov models in computational biology: Applications to protein modeling,' Journal of Molecular Biology, 235, pp.1501-1531, 1994 https://doi.org/10.1006/jmbi.1994.1104
Allison, L., Powell, D. and Dix, T. I. 'Comptession and Approximate Matching,' The Computer Journal, 42(1), pp. 1-10, 1999 https://doi.org/10.1093/comjnl/42.1.1
Breimer, E. A. A Learning Approach for Designing Dynamic Programming Algorithms, http://www.cs.rpi.edu/~breime/slide/, 2000
이재성, 다국어 정보검색을 위한 영-한 음차 표기 및 복원 모델, 한국과학기술원 박사학위논문, 1999
국립국어연구원, 표준대국어사전, (주)두산동아, 2000
CMU, CMU Pronouncing Dictionary, http://www.speech.cs.cmu.edu/speech/
이성진, Two-Level 한국어 형태소 해석, 한국과학기술원, 전산학과, 석사학위 논문, 1992
Antworth, E. L., PC-KIMMO : A Two-level Processor for Morphological Analysis, Summer Institute of Linguistics, 1990
김재훈, 김길창, 한국어에서의 품사 부착 말뭉치의 작성 요령 : KAIST 말뭉치, 한국과학기술원, 전산학과, CS-TR-95-99, 1995
Rainber, L. R., 'A tutorial on hidden Markov models and selected application in speech recognition,' Proceedings of the IEEE, 77(2), pp.257-286. 1989 https://doi.org/10.1109/5.18626
Huang, E.-F., Soong, F. K., and Wang, H.-C., 'The use of tree-trellis search for large-vocabulary mandarin polysyllabic word speech recognition,' Computer Speech and Language, 8, pp.39-50, 1994 https://doi.org/10.1006/csla.1994.1002
Computer Speech and Language v.8 The use of tree-trellis search for large-vocabulary mandarin polysyllabic word speech recognition Huang, E.-F.;Soong, F. K.;Wang, H.-C. https://doi.org/10.1006/csla.1994.1002

The KIPS Transactions:PartB (정보처리학회논문지B)

Construction of Linearly Aliened Corpus Using Unsupervised Learning

자율 학습을 이용한 선형 정렬 말뭉치 구축

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)