Mining Parallel Text from the Web based on Sentence Alignment

  • Li, Bo (School of Computer Science, Wuhan University) ;
  • Liu, Juan (School of Computer Science, Wuhan University) ;
  • Zhu, Huili (School of Computer Science, Wuhan University)
  • Published : 2007.11.01

Abstract

The parallel corpus is an important resource in the research field of data-driven natural language processing, but there are only a few parallel corpora publicly available nowadays, mostly due to the high labor force needed to construct this kind of resource. A novel strategy is brought out to automatically fetch parallel text from the web in this paper, which may help to solve the problem of the lack of parallel corpora with high quality. The system we develop first downloads the web pages from certain hosts. Then candidate parallel page pairs are prepared from the page set based on the outer features of the web pages. The candidate page pairs are evaluated in the last step in which the sentences in the candidate web page pairs are extracted and aligned first, and then the similarity of the two web pages is evaluate based on the similarities of the aligned sentences. The experiments towards a multilingual web site show the satisfactory performance of the system.

Keywords