Intra-Sentence Segmentation using Maximum Entropy Model for Efficient Parsing of English Sentences

효율적인 영어 구문 분석을 위한 최대 엔트로피 모델에 의한 문장 분할

  • 김성동 (한성대학교 컴퓨터공학부)
  • Published : 2005.05.01

Abstract

Long sentence analysis has been a critical problem in machine translation because of high complexity. The methods of intra-sentence segmentation have been proposed to reduce parsing complexity. This paper presents the intra-sentence segmentation method based on maximum entropy probability model to increase the coverage and accuracy of the segmentation. We construct the rules for choosing candidate segmentation positions by a teaming method using the lexical context of the words tagged as segmentation position. We also generate the model that gives probability value to each candidate segmentation positions. The lexical contexts are extracted from the corpus tagged with segmentation positions and are incorporated into the probability model. We construct training data using the sentences from Wall Street Journal and experiment the intra-sentence segmentation on the sentences from four different domains. The experiments show about $88\%$ accuracy and about $98\%$ coverage of the segmentation. Also, the proposed method results in parsing efficiency improvement by 4.8 times in speed and 3.6 times in space.

긴 문장 분석은 높은 분석 복잡도로 인해 기계 번역에서 매우 어려운 문제이다. 구문 분석의 복잡도를 줄이기 위하여 문장 분할 방법이 제안되었으며 본 논문에서는 문장 분할의 적용률과 정확도를 높이기 위한 최대 엔트로피 확률 모델 기반의 문장 분할 방법을 제시한다. 분할 위치의 어휘 문맥적 특징을 추출하여 후보 분할 위치를 선정하는 규칙을 학습을 통해 자동적으로 획득하고 각 후보 분할 위치에 분할 확률 값을 제공하는 확률 모델을 생성한다. 어휘 문맥은 문장 분할 위치가 표시된 말뭉치로부터 추출되며 최대 엔트로피 원리에 기반하여 확률 모델에 결합된다. Wall Street Journal의 문장을 추출하여 학습 데이타를 생성하는 말뭉치를 구축하고 네 개의 서로 다른 영역으로부터 문장을 추출하여 문장 분할 실험을 하였다. 실험을 통해 약 $88\%$의 문장 분할의 정확도와 약 $98\%$의 적용률을 보였다. 또한 문장 분할이 효율적인 파싱에 기여하는 정도를 측정하여 분석 시간 면에서 약 4.8배, 공간 면에서 약 3.6배의 분석 효율이 향상되었음을 확인하였다.

Keywords

References

  1. J. Lafferty, D. Beeferman, and A. Berger, 'Text Segmentation using Exponential Models,' In Second Conference on Empirical Metlwds in Natural Language Processing, 1997, Providence, RI
  2. David D. Palmer and Marti A. Hearst, 'Adaptive Multilingual Sentence Boundary Disambiguation,' Computational Linguistics, Vol. 23, No.2, pp. 241-265, 1997
  3. J. C. Reynar and A. Ratnaparkhi. 'A Maximum Entropy Approach to Identifying Sentence Boundaries,' In Proceedings of the Fifth Conference on Applied Natural Language Processing, pp. 16-19, 1997, Washington D.C https://doi.org/10.3115/974557.974561
  4. C. Lyon and B. Dickerson, 'Reducing the Complexity of Parsing by a Method of Decomposition,' In International Workshop on Parsing Technology, Sept., 1997
  5. C. Lyon and R. Frank, 'Neural Network Design for a Natural Language Parser,' In International Conference on Artificial Neural Networks, 1995
  6. Osamu Furuse and Hitoshi Iida, 'Constituent Boundary Parsing for Example-Based machine Translation,' In Proceedings of 1994 Conference on Computational Linguistics, pp. 105-111, 1994, Kyoto, Japan https://doi.org/10.3115/991886.991902
  7. S. D. Kim and Y. T. Kim, 'Sentence Analysis using Pattern Matching in English-Korean Machine Translation,' In Proceedings of the 1995 ICCPOL, pp. 25-28, 1995
  8. E. T. Jaynes, 'Information Theory and Statistical Mechanics,' Physical Review, Vol. 106, pp. 620-630, 1957 https://doi.org/10.1103/PhysRev.106.620
  9. Adam L. Berger, Stephen A. Della Pietra, and Vincent J. Pietra, 'A Maximum Entropy Approach to Natural Language Processing,' Computational Linguistics, Vol. 22, No.1, pp. 39-72, 1996
  10. A. Ratnaparkhi, 'A Maximum Entropy Part of Speech Tagger,' In E. Brill and K. Church, editors, Conference on Empirical Methods in Natural Language Processing, 1996, University of Pennsylvania
  11. Eric S. Ristad, 'Maximum Entropy Modeling for Natural Language,' 1997, Madrid
  12. F. Jelinek and R. L. Mercer, 'Interpolated Estimation of Markov Source Parameters from Sparse Data,' In Workshop on Pattern Recognition in Practice, 1980, Amsterdam, The Netherlands
  13. S. M. Katz, 'Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer,' IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 35, 1987
  14. Sheldon M. Ross, 'Introduction to Probability Models,' Academic Press, 1997
  15. A. Ratnaparkhi, 'A Simple Introduction to Maximum Entropy Models for Natural Language Processing,' Technical report, Institute for Research in Cognitive Science, University of Pennsylvania, 1994, IRCS Report 97-08
  16. Tom M. Mitchell, 'Machine Learning,' The McGraw-Hill Companies, Inc., 1997
  17. 김성동, 김영택. '효율적인 영어 구문 분석을 위한 문장 분할', 한국 정보과학회 논문지, Vol. 24, No. 8, pp. 884-890, 1997