Search | Korea Science

Bilingual lexicon induction through a pivot language

Kim, Jae-Hoon;Seo, Hyeong-Won;Kwon, Hong-Seok
- Journal of Advanced Marine Engineering and Technology
- /
- v.37 no.3
- /
- pp.300-306
- /
- 2013
This paper presents a new method for constructing bilingual lexicons through a pivot language. The proposed method is adapted from the context-based approach, called the standard approach, which is well-known for building bilingual lexicons using comparable corpora. The main difference between the standard approach and the proposed method is how to represent context vectors. The former is to represent context vectors in a target language, while the latter in a pivot language. The proposed method is very simplified from the standard approach thereby. Furthermore, the proposed method is more accurate than the standard approach because it uses parallel corpora instead of comparable corpora. The experiments are conducted on a language pair, Korean and Spanish. Our experimental results have shown that the proposed method is quite attractive where a parallel corpus directly between source and target languages are unavailable, but both source-pivot and pivot-target parallel corpora are available.
https://doi.org/10.5916/jkosme.2013.37.3.300 인용 PDF KSCI

Mining Parallel Text from the Web based on Sentence Alignment

Li, Bo;Liu, Juan;Zhu, Huili
- Proceedings of the Korean Society for Language and Information Conference
- /
- 2007.11a
- /
- pp.285-292
- /
- 2007
The parallel corpus is an important resource in the research field of data-driven natural language processing, but there are only a few parallel corpora publicly available nowadays, mostly due to the high labor force needed to construct this kind of resource. A novel strategy is brought out to automatically fetch parallel text from the web in this paper, which may help to solve the problem of the lack of parallel corpora with high quality. The system we develop first downloads the web pages from certain hosts. Then candidate parallel page pairs are prepared from the page set based on the outer features of the web pages. The candidate page pairs are evaluated in the last step in which the sentences in the candidate web page pairs are extracted and aligned first, and then the similarity of the two web pages is evaluate based on the similarities of the aligned sentences. The experiments towards a multilingual web site show the satisfactory performance of the system.
PDF

A Hybrid Sentence Alignment Method for Building a Korean-English Parallel Corpus (한영 병렬 코퍼스 구축을 위한 하이브리드 기반 문장 자동 정렬 방법)

Park, Jung-Yeul;Cha, Jeong-Won
- MALSORI
- /
- v.68
- /
- pp.95-114
- /
- 2008
The recent growing popularity of statistical methods in machine translation requires much more large parallel corpora. A Korean-English parallel corpus, however, is not yet enoughly available, little research on this subject is being conducted. In this paper we present a hybrid method of aligning sentences for Korean-English parallel corpora. We use bilingual news wire web pages, reading comprehension materials for English learners, computer-related technical documents and help files of localized software for building a Korean-English parallel corpus. Our hybrid method combines sentence-length based and word-correspondence based methods. We show the results of experimentation and evaluate them. Alignment results from using a full translation model are very encouraging, especially when we apply alignment results to an SMT system: 0.66% for BLEU score and 9.94% for NIST score improvement compared to the previous method.
PDF

Analyzing Errors in Bilingual Multi-word Lexicons Automatically Constructed through a Pivot Language

Seo, Hyeong-Won;Kim, Jae-Hoon
- Journal of Advanced Marine Engineering and Technology
- /
- v.39 no.2
- /
- pp.172-178
- /
- 2015
Constructing a bilingual multi-word lexicon is confronted with many difficulties such as an absence of a commonly accepted gold-standard dataset. Besides, in fact, there is no everybody's definition of what a multi-word unit is. In considering these problems, this paper evaluates and analyzes the context vector approach which is one of a novel alignment method of constructing bilingual lexicons from parallel corpora, by comparing with one of general methods. The approach builds context vectors for both source and target single-word units from two parallel corpora. To adapt the approach to multi-word units, we identify all multi-word candidates (namely noun phrases in this work) first, and then concatenate them into single-word units. As a result, therefore, we can use the context vector approach to satisfy our need for multi-word units. In our experimental results, the context vector approach has shown stronger performance over the other approach. The contribution of the paper is analyzing the various types of errors for the experimental results. For the future works, we will study the similarity measure that not only covers a multi-word unit itself but also covers its constituents.
https://doi.org/10.5916/jkosme.2015.39.2.172 인용 PDF KSCI

The Use of MSVM and HMM for Sentence Alignment

Fattah, Mohamed Abdel
- Journal of Information Processing Systems
- /
- v.8 no.2
- /
- pp.301-314
- /
- 2012
In this paper, two new approaches to align English-Arabic sentences in bilingual parallel corpora based on the Multi-Class Support Vector Machine (MSVM) and the Hidden Markov Model (HMM) classifiers are presented. A feature vector is extracted from the text pair that is under consideration. This vector contains text features such as length, punctuation score, and cognate score values. A set of manually prepared training data was assigned to train the Multi-Class Support Vector Machine and Hidden Markov Model. Another set of data was used for testing. The results of the MSVM and HMM outperform the results of the length based approach. Moreover these new approaches are valid for any language pairs and are quite flexible since the feature vector may contain less, more, or different features, such as a lexical matching feature and Hanzi characters in Japanese-Chinese texts, than the ones used in the current research.
https://doi.org/10.3745/JIPS.2012.8.2.301 인용 PDF KSCI

Automatic Acquisition of a High-Precision Translation Lexicon from Parallel Chinese English Corpora

Gao, Zhao-Ming
- Proceedings of the Korean Society for Language and Information Conference
- /
- 1998.02a
- /
- pp.248-254
- /
- 1998
PDF

Sentence and Paragraph Alignment in Japanese to Korean Parallel Corpora (일한대역문의 문장과 단락의 정렬)

;長尾眞
- Korean Journal of Cognitive Science
- /
- v.7 no.4
- /
- pp.179-202
- /
- 1996

Automatically Extracting Unknown Translations Using Phrase Alignment (정렬기법을 이용한 미등록 대역어의 자동 추출)

Kim, Jae-Hoon;Yang, Sung-Il
- The KIPS Transactions:PartB
- /
- v.14B no.3 s.113
- /
- pp.231-240
- /
- 2007
In this paper, we propose an automatic extraction model for unknown translations and implement an unknown translation extraction system using the proposed model. The proposed model as a phrase-alignment model is incorporated with three models: a phrase-boundary model, a language model, and a translation model. Using the proposed model we implement the system for extracting unknown translations, which consists of three parts: construction of parallel corpora, alignment of Korean and English words, extraction of unknown translations. To evaluate the performance of the proposed system we have established the reference corpus for extracting unknown translation, which comprises of 2,220 parallel sentences including about 1,500 unknown translations. Through several experiments, we have observed that the proposed model is very useful for extracting unknown translations. In the future, researches on objective evaluation and establishment of parallel corpora with good quality should be performed and studies on improving the performance of unknown translation extraction should be kept up.
https://doi.org/10.3745/KIPSTB.2007.14-B.3.231 인용 PDF KSCI

Automatic Acquisition of Paraphrases Using Bilingual Dependency Relations

Hwang, Young-Sook;Kim, Young-Kil
- ETRI Journal
- /
- v.30 no.1
- /
- pp.155-157
- /
- 2008
This letter introduces a new method to automatically acquire paraphrases using bilingual corpora. It utilizes the bilingual dependency relations obtained by projecting a monolingual dependency parse onto the other language's sentence based on statistical alignment techniques. Since the proposed paraphrasing method can clearly disambiguate the sense of the original phrases using the bilingual context of dependency relations, it would be possible to obtain interchangeable paraphrases under a given context. Through experiments with parallel corpora of Korean and English language pairs, we demonstrate that our method effectively extracts paraphrases with high precision, achieving success rates of 94.3% and 84.6%, respectively, for Korean and English.
PDF

Addressing Low-Resource Problems in Statistical Machine Translation of Manual Signals in Sign Language (말뭉치 자원 희소성에 따른 통계적 수지 신호 번역 문제의 해결)

Park, Hancheol;Kim, Jung-Ho;Park, Jong C.
- Journal of KIISE
- /
- v.44 no.2
- /
- pp.163-170
- /
- 2017
Despite the rise of studies in spoken to sign language translation, low-resource problems of sign language corpus have been rarely addressed. As a first step towards translating from spoken to sign language, we addressed the problems arising from resource scarcity when translating spoken language to manual signals translation using statistical machine translation techniques. More specifically, we proposed three preprocessing methods: 1) paraphrase generation, which increases the size of the corpora, 2) lemmatization, which increases the frequency of each word in the corpora and the translatability of new input words in spoken language, and 3) elimination of function words that are not glossed into manual signals, which match the corresponding constituents of the bilingual sentence pairs. In our experiments, we used different types of English-American sign language parallel corpora. The experimental results showed that the system with each method and the combination of the methods improved the quality of manual signals translation, regardless of the type of the corpora.
https://doi.org/10.5626/JOK.2017.44.2.163 인용 KSCI

Search Result 22, Processing Time 0.025 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)