Improved Sentence Boundary Detection Method for Web Documents

Lee, Chung-Hee;Jang, Myung-Gil;Seo, Young-Hoon;

Journal of KIISE:Software and Applications (한국정보과학회논문지:소프트웨어및응용)

Volume 37 Issue 6
/
Pages.455-463
/
2010
/
1229-6848(pISSN)

Korean Institute of Information Scientists and Engineers (한국정보과학회)

Improved Sentence Boundary Detection Method for Web Documents

웹 문서를 위한 개선된 문장경계인식 방법

이충희 (한국전자통신연구원 지식마이닝팀) ;
장명길 (한국전자통신연구원 지식마이닝팀) ;
서영훈 (충북대학교 컴퓨터공학과)

Received : 2009.11.12
Accepted : 2010.03.19
Published : 2010.06.15

PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

In this paper, we present an approach to sentence boundary detection for web documents that builds on statistical-based methods and uses rule-based correction. The proposed system uses the classification model learned offline using a training set of human-labeled web documents. The web documents have many word-spacing errors and frequently no punctuation mark that indicates the end of sentence boundary. As sentence boundary candidates, the proposed method considers every Ending Eomis as well as punctuation marks. We optimize engine performance by selecting the best feature, the best training data, and the best classification algorithm. For evaluation, we made two test sets; Set1 consisting of articles and blog documents and Set2 of web community documents. We use F-measure to compare results on a large variety of tasks, Detecting only periods as sentence boundary, our basis engine showed 96.5% in Set1 and 56.7% in Set2. We improved our basis engine by adapting features and the boundary search algorithm. For the final evaluation, we compared our adaptation engine with our basis engine in Set2. As a result, the adaptation engine obtained improvements over the basis engine by 39.6%. We proved the effectiveness of the proposed method in sentence boundary detection.

본 논문은 다양한 형태의 웹 문서에 적용하기 위해서, 언어의 통계정보 및 후처리 규칙에 기반하여 개선한 문장경계 인식 기술을 제안한다. 제안한 방법은 구두점 생략 및 띄어쓰기 오류가 빈번한 웹문서에 적용하기 위해서 문장경계로 사용될 수 있는 모든 종결어미를 대상으로 학습하여 문장경계 인식을 수행하였다. 또한 문장경계인식 성능을 최대화하기 위해서 다양한 실험을 통해 최적의 자질 및 학습데이터를 선정하였고, 학습데이터에 의존적인 통계모델의 오류를 규칙에 기반 해서 보정하였다. 성능 실험은 다양한 문서별 성능 측정을 위해서 구두점이 주로 문장경계로 사용된 문어체 위주의 평가셋1(신문기사와 블로그 문서)과 구두점 생략 및 띄어쓰기 오류가 빈번한 웹 문서 위주의 평가셋2(웹 사이트의 게시판 글)를 대상으로 성능을 측정하였다. 평가 척도로는 F-measure를 사용하였으며, 기존 연구와 동일하게 구두점만을 문장경계 대상으로 학습한 기본 모델을 만들어서 실험한 결과, 평가셋1에 대해서 96.5%의 성능을 보였지만, 평가셋2에 대해서는 56.7%로 매우 저조한 성능을 보였다. 제안하는 개선 방법은 기본 모델을 웹 문서의 특징을 반영시키도록 자질 및 엔진을 개선시켰고, 최종 모델을 평가셋2로 평가한 결과, 96.3%의 성능을 보여서 39.6%의 성능 향상이 있음을 확인하였다.

Keywords

References

S. H. Park, H. C. Rim, "Sentence Boundary Detection Using Machine Learning Techniques," Proc. of the 35th KIISE Springl Conference, vol.15, no.1, pp.122-124, 2008. (in Korean)
C. Lee, M. G. Jang, "Fast Training of Structured SVM Using Fixed-Threshold Sequential Minimal Optimization," Journal of ETRI Journal, vol.31, no.2, Apr. pp.121-128, 2009. https://doi.org/10.4218/etrij.09.0108.0276
G. Grefenstette, P. Tapanainen, "What is a word, what is a sentence? problems of tokenization," Proc. of the 3rd International Conference on Computational Lexicography, pp.79-87, 1994.
J. O'Neil, "Doing Things with Words, Part Two: Sentence Boundary Detection," URL: http://www.attivio.com/blog/57-unified-information-access/263- doing-things-with-words-part-two-sentence-boun dary-detection.html#ixzz0QOiR1kVm, 2008.
S. Fakotakis, Kokkinakis, "Automatic extraction of rules for sentence boundary disambiguation," Proc. of ACAI'99, 1999
Riley, Michael, "Some Applications of Tree-based Modeling to Speech and Language Indexing," Proc. of the DARPA speech and natural language workshop, pp.339-352, 1989.
D. D. Palmer, M. A. Hearst, "Adaptive sentence boundary disambiguation," Proce. of the fourth conference on Applied natural language processing, 1994.
D. D. Palmer, M.A.Hearst, "Adaptive Multilingual Sentence Boundary Disambiguation," Journal of Computational Linguistics, vol.23, no.2, pp.241-267, 1997.
J. C. Reynar, A. Ratnaparkhi, "A Maximum Entropy Approach to Identifying Sentence Boundaries," Proc. of the Fifth Conference on Applied Natural Language Processing, pp.16-19, 1997.
A. Mikheev, "Tagging Sentence Boundaries," Proc. of NACL'2000, pp.264-271, 2000.
H. Wang, Y. Huang, "Bondec: A Sentence Boundary Detector," URL: http://nlp.stan-ford.edu/courses/cs224n /2003/fp/huangy/final_project.doc, 2003.
Y. Liu, A. Stolcke, E. Shriberg, M. Harper, "Using conditional random fields for sentence boundary detection in speech," Proc. of ACL'05, 2005.
T. Kiss, J. Strunk, "Unsupervised Multilingual Sentence Boundary Detection," Journal of Computational Linguistics, vol.32 Issue 4, 2006.
H. S. Lim, K. H. Han, "Korean Sentence Boundary Detection Using Memory-based Machine Learning," Journal of The Korea Contents Society, vol.4 no.4, pp.133-139, 2004. (In Korean)

Journal of KIISE:Software and Applications (한국정보과학회논문지:소프트웨어및응용)

Improved Sentence Boundary Detection Method for Web Documents

웹 문서를 위한 개선된 문장경계인식 방법

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)