Mining Parallel Text from the Web based on Sentence Alignment

Li, Bo;Liu, Juan;Zhu, Huili;

Proceedings of the Korean Society for Language and Information Conference (한국언어정보학회:학술대회논문집)

2007.11a
/
Pages.285-292
/
2007

Korean Society for Language and Information (한국언어정보학회)

Mining Parallel Text from the Web based on Sentence Alignment

Li, Bo (School of Computer Science, Wuhan University) ;
Liu, Juan (School of Computer Science, Wuhan University) ;
Zhu, Huili (School of Computer Science, Wuhan University)

Published : 2007.11.01

PDF

Download PDF

⟨ Previous Next ⟩

Abstract

The parallel corpus is an important resource in the research field of data-driven natural language processing, but there are only a few parallel corpora publicly available nowadays, mostly due to the high labor force needed to construct this kind of resource. A novel strategy is brought out to automatically fetch parallel text from the web in this paper, which may help to solve the problem of the lack of parallel corpora with high quality. The system we develop first downloads the web pages from certain hosts. Then candidate parallel page pairs are prepared from the page set based on the outer features of the web pages. The candidate page pairs are evaluated in the last step in which the sentences in the candidate web page pairs are extracted and aligned first, and then the similarity of the two web pages is evaluate based on the similarities of the aligned sentences. The experiments towards a multilingual web site show the satisfactory performance of the system.

Proceedings of the Korean Society for Language and Information Conference (한국언어정보학회:학술대회논문집)

Mining Parallel Text from the Web based on Sentence Alignment

Abstract

Keywords

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)