• Title/Summary/Keyword: English-Korean alignment

Search Result 38, Processing Time 0.023 seconds

A Hybrid Sentence Alignment Method for Building a Korean-English Parallel Corpus (한영 병렬 코퍼스 구축을 위한 하이브리드 기반 문장 자동 정렬 방법)

  • Park, Jung-Yeul;Cha, Jeong-Won
    • MALSORI
    • /
    • v.68
    • /
    • pp.95-114
    • /
    • 2008
  • The recent growing popularity of statistical methods in machine translation requires much more large parallel corpora. A Korean-English parallel corpus, however, is not yet enoughly available, little research on this subject is being conducted. In this paper we present a hybrid method of aligning sentences for Korean-English parallel corpora. We use bilingual news wire web pages, reading comprehension materials for English learners, computer-related technical documents and help files of localized software for building a Korean-English parallel corpus. Our hybrid method combines sentence-length based and word-correspondence based methods. We show the results of experimentation and evaluate them. Alignment results from using a full translation model are very encouraging, especially when we apply alignment results to an SMT system: 0.66% for BLEU score and 9.94% for NIST score improvement compared to the previous method.

  • PDF

An Automatic Extraction of English-Korean Bilingual Terms by Using Word-level Presumptive Alignment (단어 단위의 추정 정렬을 통한 영-한 대역어의 자동 추출)

  • Lee, Kong Joo
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.2 no.6
    • /
    • pp.433-442
    • /
    • 2013
  • A set of bilingual terms is one of the most important factors in building language-related applications such as a machine translation system and a cross-lingual information system. In this paper, we introduce a new approach that automatically extracts candidates of English-Korean bilingual terms by using a bilingual parallel corpus and a basic English-Korean lexicon. This approach can be useful even though the size of the parallel corpus is small. A sentence alignment is achieved first for the document-level parallel corpus. We can align words between a pair of aligned sentences by referencing a basic bilingual lexicon. For unaligned words between a pair of aligned sentences, several assumptions are applied in order to align bilingual term candidates of two languages. A location of a sentence, a relation between words, and linguistic information between two languages are examples of the assumptions. An experimental result shows approximately 71.7% accuracy for the English-Korean bilingual term candidates which are automatically extracted from 1,000 bilingual parallel corpus.

English-Korean speech translation corpus (EnKoST-C): Construction procedure and evaluation results

  • Jeong-Uk Bang;Joon-Gyu Maeng;Jun Park;Seung Yun;Sang-Hun Kim
    • ETRI Journal
    • /
    • v.45 no.1
    • /
    • pp.18-27
    • /
    • 2023
  • We present an English-Korean speech translation corpus, named EnKoST-C. End-to-end model training for speech translation tasks often suffers from a lack of parallel data, such as speech data in the source language and equivalent text data in the target language. Most available public speech translation corpora were developed for European languages, and there is currently no public corpus for English-Korean end-to-end speech translation. Thus, we created an EnKoST-C centered on TED Talks. In this process, we enhance the sentence alignment approach using the subtitle time information and bilingual sentence embedding information. As a result, we built a 559-h English-Korean speech translation corpus. The proposed sentence alignment approach showed excellent performance of 0.96 f-measure score. We also show the baseline performance of an English-Korean speech translation model trained with EnKoST-C. The EnKoST-C is freely available on a Korean government open data hub site.

An Alignment Model for Extracting English-Korean Translations of Term Constituents (영-한 조어단위 대역쌍 추출을 위한 조어단위 정렬 모델)

  • Oh Jong-Hoon;Huang Jin-Xia;Choi Key-Sun
    • Journal of KIISE:Software and Applications
    • /
    • v.32 no.4
    • /
    • pp.300-311
    • /
    • 2005
  • Terms are linguistic realization of technical concepts. Term constituents are important elements used for representing the concept. Since many new terms are created from the modification or combination of existing constituents, it is important to analyze term constituents for understanding the concept of the term. It means that term constituents offer clues for understanding the concept of terms. However, there are a couple of difficulties in matching concept unit and term constituents such as mismatching between a term constituent and a concept unit, homonym of term constituents and synonym of term constituents. To solve them, it is necessary to recognize concept units of term constituents. In this paper, we define an English term constituent as the concept unit and use an alignment algorithm between English-Korean term constituents in order to recognize concept units of term constituents. By our alignment algorithm we recognize Korean term constituents corresponding to an English term constituent with about $93\%$ precision.

Comparison of Phone Boundary Alignment between Handlabels and Autolabels

  • Jang, Tae-Yeoub;Chung, Hyun-Song
    • Speech Sciences
    • /
    • v.10 no.1
    • /
    • pp.27-39
    • /
    • 2003
  • This study attempts to verify the reliability of automatically generated segment labels as compared to those obtained by conventional labelling by hand. First of all, an autolabeller is constructed using the standard HMM speech recognition technique. For evaluation, we compare the automatically generated labels with manually annotated labels for the same speech data. The comparison is performed by calculating the temporal difference between an autolabel boundary and its corresponding hand label boundary. When the mismatched duration between two labels falls within 10 msec, we consider the autolabel as correct. The results suggest that overall 78% of autolabels are correctly obtained. It is found that the boundary of obstruents is better aligned than that of sonorants and vowels. In case of stop sound classes, strong stops in manner-of-articulation wise and velar stops in place-of-articulation wise show better performance in boundary alignment. The result suggests that more phone-specific consideration is necessary to improve autosegmentation performance.

  • PDF

The Effects of Misalignment between Syllable and Word Onsets on Word Recognition in English (음절의 시작과 단어 시작의 불일치가 영어 단어 인지에 미치는 영향)

  • Kim, Sun-Mi;Nam, Ki-Chun
    • Phonetics and Speech Sciences
    • /
    • v.1 no.4
    • /
    • pp.61-71
    • /
    • 2009
  • This study aims to investigate whether the misalignment between syllable and word onsets due to the process of resyllabification affects Korean-English late bilinguals perceiving English continuous speech. Two word-spotting experiments were conducted. In Experiment 1, misalignment conditions (resyllabified conditions) were created by adding CVC contexts at the beginning of vowel-initial words and alignment conditions (non-resyllabified conditions) were made by putting the same CVC contexts at the beginning of consonant-initial words. The results of Experiment 1 showed that detections of targets in alignment conditions were faster and more correct than in misalignment conditions. Experiment 2 was conducted in order to avoid any possibilities that the results of Experiment 1 were due to consonant-initial words being easier to recognize than vowel-initial words. For this reason, all the experimental stimuli of Experiment 2 were vowel-initial words preceded by CVC contexts or CV contexts. Experiment 2 also showed misalignment cost when recognizing words in resyllabified conditions. These results indicate that Korean listeners are influenced by misalignment between syllable and word onsets triggered by a resyllabification process when recognizing words in English connected speech.

  • PDF

A Study on Automatic Measurement of Pronunciation Accuracy of English Speech Produced by Korean Learners of English (한국인 영어 학습자의 발음 정확성 자동 측정방법에 대한 연구)

  • Yun, Weon-Hee;Chung, Hyun-Sung;Jang, Tae-Yeoub
    • Proceedings of the KSPS conference
    • /
    • 2005.11a
    • /
    • pp.17-20
    • /
    • 2005
  • The purpose of this project is to develop a device that can automatically measure pronunciation of English speech produced by Korean learners of English. Pronunciation proficiency will be measured largely in two areas; suprasegmental and segmental areas. In suprasegmental area, intonation and word stress will be traced and compared with those of native speakers by way of statistical methods using tilt parameters. Durations of phones are also examined to measure speakers' naturalness of their pronunciations. In doing so, statistical duration modelling from a large speech database using CART will be considered. For segmental measurement of pronunciation, acoustic probability of a phone, which is a byproduct when doing the forced alignment, will be a basis of scoring pronunciation accuracy of a phone. The final score will be a feedback to the learners to improve their pronunciation.

  • PDF

Word class information in perception of prosodic prominence by Korean learners of English

  • Im, Suyeon
    • Phonetics and Speech Sciences
    • /
    • v.11 no.4
    • /
    • pp.1-8
    • /
    • 2019
  • This study aims to investigate how prosodic prominence is perceived in relation to word class information (or parts-of-speech) by Korean learners of English compared with native English speakers in public speech. Two groups, Korean learners of English and native English speakers, were asked to judge words perceived as prominent simultaneously while listening to a speech. Parts-of-speech and three acoustic cues (i.e., max F0, mean phone duration, and mean intensity) were analyzed for each word in the speech. The results showed that content words tended to be higher in pitch and longer in duration than function words. Both groups of listeners rated prominence on content words more frequently than on function words. This tendency, however, was significantly greater for Korean learners of English than for native English speakers. Among the parts-of-speech of the content words, Korean learners of English were more likely than native English speakers to judge nouns and verbs as prominent. This study presents evidence that Korean learners of English consider most, if not all, content words as landing locations of prosodic prominence, in alignment with the previous study on the production of prominence.

Alignment of Hypernym-Hyponym Noun Pairs between Korean and English, Based on the EuroWordNet Approach (유로워드넷 방식에 기반한 한국어와 영어의 명사 상하위어 정렬)

  • Kim, Dong-Sung
    • Language and Information
    • /
    • v.12 no.1
    • /
    • pp.27-65
    • /
    • 2008
  • This paper presents a set of methodologies for aligning hypernym-hyponym noun pairs between Korean and English, based on the EuroWordNet approach. Following the methods conducted in EuroWordNet, our approach makes extensive use of WordNet in four steps of the building process: 1) Monolingual dictionaries have been used to extract proper hypernym-hyponym noun pairs, 2) bilingual dictionary has converted the extracted pairs, 3) Word Net has been used as a backbone of alignment criteria, and 4) WordNet has been used to select the most similar pair among the candidates. The importance of this study lies not only on enriching semantic links between two languages, but also on integrating lexical resources based on a language specific and dependent structure. Our approaches are aimed at building an accurate and detailed lexical resource with proper measures rather than at fast development of generic one using NLP technique.

  • PDF

Automatic Acquisition of Paraphrases Using Bilingual Dependency Relations

  • Hwang, Young-Sook;Kim, Young-Kil
    • ETRI Journal
    • /
    • v.30 no.1
    • /
    • pp.155-157
    • /
    • 2008
  • This letter introduces a new method to automatically acquire paraphrases using bilingual corpora. It utilizes the bilingual dependency relations obtained by projecting a monolingual dependency parse onto the other language's sentence based on statistical alignment techniques. Since the proposed paraphrasing method can clearly disambiguate the sense of the original phrases using the bilingual context of dependency relations, it would be possible to obtain interchangeable paraphrases under a given context. Through experiments with parallel corpora of Korean and English language pairs, we demonstrate that our method effectively extracts paraphrases with high precision, achieving success rates of 94.3% and 84.6%, respectively, for Korean and English.

  • PDF