• 제목/요약/키워드: Parallel corpora

검색결과 22건 처리시간 0.018초

Bilingual lexicon induction through a pivot language

  • Kim, Jae-Hoon;Seo, Hyeong-Won;Kwon, Hong-Seok
    • Journal of Advanced Marine Engineering and Technology
    • /
    • 제37권3호
    • /
    • pp.300-306
    • /
    • 2013
  • This paper presents a new method for constructing bilingual lexicons through a pivot language. The proposed method is adapted from the context-based approach, called the standard approach, which is well-known for building bilingual lexicons using comparable corpora. The main difference between the standard approach and the proposed method is how to represent context vectors. The former is to represent context vectors in a target language, while the latter in a pivot language. The proposed method is very simplified from the standard approach thereby. Furthermore, the proposed method is more accurate than the standard approach because it uses parallel corpora instead of comparable corpora. The experiments are conducted on a language pair, Korean and Spanish. Our experimental results have shown that the proposed method is quite attractive where a parallel corpus directly between source and target languages are unavailable, but both source-pivot and pivot-target parallel corpora are available.

Mining Parallel Text from the Web based on Sentence Alignment

  • Li, Bo;Liu, Juan;Zhu, Huili
    • 한국언어정보학회:학술대회논문집
    • /
    • 한국언어정보학회 2007년도 정기학술대회
    • /
    • pp.285-292
    • /
    • 2007
  • The parallel corpus is an important resource in the research field of data-driven natural language processing, but there are only a few parallel corpora publicly available nowadays, mostly due to the high labor force needed to construct this kind of resource. A novel strategy is brought out to automatically fetch parallel text from the web in this paper, which may help to solve the problem of the lack of parallel corpora with high quality. The system we develop first downloads the web pages from certain hosts. Then candidate parallel page pairs are prepared from the page set based on the outer features of the web pages. The candidate page pairs are evaluated in the last step in which the sentences in the candidate web page pairs are extracted and aligned first, and then the similarity of the two web pages is evaluate based on the similarities of the aligned sentences. The experiments towards a multilingual web site show the satisfactory performance of the system.

  • PDF

한영 병렬 코퍼스 구축을 위한 하이브리드 기반 문장 자동 정렬 방법 (A Hybrid Sentence Alignment Method for Building a Korean-English Parallel Corpus)

  • 박정열;차정원
    • 대한음성학회지:말소리
    • /
    • 제68권
    • /
    • pp.95-114
    • /
    • 2008
  • The recent growing popularity of statistical methods in machine translation requires much more large parallel corpora. A Korean-English parallel corpus, however, is not yet enoughly available, little research on this subject is being conducted. In this paper we present a hybrid method of aligning sentences for Korean-English parallel corpora. We use bilingual news wire web pages, reading comprehension materials for English learners, computer-related technical documents and help files of localized software for building a Korean-English parallel corpus. Our hybrid method combines sentence-length based and word-correspondence based methods. We show the results of experimentation and evaluate them. Alignment results from using a full translation model are very encouraging, especially when we apply alignment results to an SMT system: 0.66% for BLEU score and 9.94% for NIST score improvement compared to the previous method.

  • PDF

Analyzing Errors in Bilingual Multi-word Lexicons Automatically Constructed through a Pivot Language

  • Seo, Hyeong-Won;Kim, Jae-Hoon
    • Journal of Advanced Marine Engineering and Technology
    • /
    • 제39권2호
    • /
    • pp.172-178
    • /
    • 2015
  • Constructing a bilingual multi-word lexicon is confronted with many difficulties such as an absence of a commonly accepted gold-standard dataset. Besides, in fact, there is no everybody's definition of what a multi-word unit is. In considering these problems, this paper evaluates and analyzes the context vector approach which is one of a novel alignment method of constructing bilingual lexicons from parallel corpora, by comparing with one of general methods. The approach builds context vectors for both source and target single-word units from two parallel corpora. To adapt the approach to multi-word units, we identify all multi-word candidates (namely noun phrases in this work) first, and then concatenate them into single-word units. As a result, therefore, we can use the context vector approach to satisfy our need for multi-word units. In our experimental results, the context vector approach has shown stronger performance over the other approach. The contribution of the paper is analyzing the various types of errors for the experimental results. For the future works, we will study the similarity measure that not only covers a multi-word unit itself but also covers its constituents.

The Use of MSVM and HMM for Sentence Alignment

  • Fattah, Mohamed Abdel
    • Journal of Information Processing Systems
    • /
    • 제8권2호
    • /
    • pp.301-314
    • /
    • 2012
  • In this paper, two new approaches to align English-Arabic sentences in bilingual parallel corpora based on the Multi-Class Support Vector Machine (MSVM) and the Hidden Markov Model (HMM) classifiers are presented. A feature vector is extracted from the text pair that is under consideration. This vector contains text features such as length, punctuation score, and cognate score values. A set of manually prepared training data was assigned to train the Multi-Class Support Vector Machine and Hidden Markov Model. Another set of data was used for testing. The results of the MSVM and HMM outperform the results of the length based approach. Moreover these new approaches are valid for any language pairs and are quite flexible since the feature vector may contain less, more, or different features, such as a lexical matching feature and Hanzi characters in Japanese-Chinese texts, than the ones used in the current research.

정렬기법을 이용한 미등록 대역어의 자동 추출 (Automatically Extracting Unknown Translations Using Phrase Alignment)

  • 김재훈;양성일
    • 정보처리학회논문지B
    • /
    • 제14B권3호
    • /
    • pp.231-240
    • /
    • 2007
  • 이 논문은 정렬 기법을 이용한 미등록 대역어 추출 모델을 제안하고 그 추출 시스템을 구현한다. 제안된 미등록 대역어 추출 모델은 일종의 구절정렬 모델로서 경계모델과 언어모델 그리고 번역 모델로 구성된다. 제안된 추출 시스템은 병렬말뭉치 구축, 단어정렬, 미등록어 추출로 구성된다. 이 논문에서는 제안된 시스템을 평가하기 위해서 약 1,500여 개의 미등록어가 포함된 2,200문장의 평가말뭉치를 구축하여 다양한 실험을 수행하였다. 실험을 통해서 제안된 모델이 미등록 대역어 추출에 매우 유용함을 알 수 있었다. 앞으로 좀 더 객관적인 평가를 위해 대량의 평가말뭉치 구축이 선행되어야 하며 좀 더 양질의 병렬말뭉치의 구축이 필요할 것이다. 또한 미등록어 추출 모델을 개선하기 다양한 연구가 추진되어야 할 것이다.

Automatic Acquisition of Paraphrases Using Bilingual Dependency Relations

  • Hwang, Young-Sook;Kim, Young-Kil
    • ETRI Journal
    • /
    • 제30권1호
    • /
    • pp.155-157
    • /
    • 2008
  • This letter introduces a new method to automatically acquire paraphrases using bilingual corpora. It utilizes the bilingual dependency relations obtained by projecting a monolingual dependency parse onto the other language's sentence based on statistical alignment techniques. Since the proposed paraphrasing method can clearly disambiguate the sense of the original phrases using the bilingual context of dependency relations, it would be possible to obtain interchangeable paraphrases under a given context. Through experiments with parallel corpora of Korean and English language pairs, we demonstrate that our method effectively extracts paraphrases with high precision, achieving success rates of 94.3% and 84.6%, respectively, for Korean and English.

  • PDF

말뭉치 자원 희소성에 따른 통계적 수지 신호 번역 문제의 해결 (Addressing Low-Resource Problems in Statistical Machine Translation of Manual Signals in Sign Language)

  • 박한철;김정호;박종철
    • 정보과학회 논문지
    • /
    • 제44권2호
    • /
    • pp.163-170
    • /
    • 2017
  • 통계적 기계 번역을 이용한 구어-수화 번역 연구가 활발해짐에도 불구하고 수화 말뭉치의 자원 희소성 문제는 해결되지 않고 있다. 본 연구는 수화 번역의 첫 번째 단계로써 통계적 기계 번역을 이용한 구어-수지 신호 번역에서 말뭉치 자원 희소성으로부터 기인하는 문제점들을 해결할 수 있는 세 가지 전처리 방법을 제안한다. 본 연구에서 제안하는 방법은 1) 구어 문장의 패러프레이징을 통한 말뭉치 확장 방법, 2) 구어 단어의 표제어화를 통한 개별 어휘 출현 빈도 증가 및 구어 표현의 번역 가능성을 향상시키는 방법, 그리고 3) 수지 표현으로 전사되지 않는 구어의 기능어 제거를 통한 구어-수지 표현 간 문장 성분을 일치시키는 방법이다. 서로 다른 특징을 지닌 영어-미국 수화 병렬 말뭉치들을 이용한 실험에서 각 방법론들이 단독으로 쓰일 때와 조합되어 함께 사용되었을 때 모두 말뭉치의 종류와 관계없이 번역 성능을 개선시킬 수 있다는 것을 확인할 수 있었다.