• 제목/요약/키워드: English-Korean alignment

검색결과 38건 처리시간 0.02초

한영 병렬 코퍼스 구축을 위한 하이브리드 기반 문장 자동 정렬 방법 (A Hybrid Sentence Alignment Method for Building a Korean-English Parallel Corpus)

  • 박정열;차정원
    • 대한음성학회지:말소리
    • /
    • 제68권
    • /
    • pp.95-114
    • /
    • 2008
  • The recent growing popularity of statistical methods in machine translation requires much more large parallel corpora. A Korean-English parallel corpus, however, is not yet enoughly available, little research on this subject is being conducted. In this paper we present a hybrid method of aligning sentences for Korean-English parallel corpora. We use bilingual news wire web pages, reading comprehension materials for English learners, computer-related technical documents and help files of localized software for building a Korean-English parallel corpus. Our hybrid method combines sentence-length based and word-correspondence based methods. We show the results of experimentation and evaluate them. Alignment results from using a full translation model are very encouraging, especially when we apply alignment results to an SMT system: 0.66% for BLEU score and 9.94% for NIST score improvement compared to the previous method.

  • PDF

단어 단위의 추정 정렬을 통한 영-한 대역어의 자동 추출 (An Automatic Extraction of English-Korean Bilingual Terms by Using Word-level Presumptive Alignment)

  • 이공주
    • 정보처리학회논문지:소프트웨어 및 데이터공학
    • /
    • 제2권6호
    • /
    • pp.433-442
    • /
    • 2013
  • 기계번역 시스템 구축에 가장 필수적인 요소는 번역하고자 하는 언어간의 단어쌍을 담고 있는 대역어 사전이다. 대역어 사전은 기계번역뿐만 아니라 서로 다른 언어간의 정보를 교환하는 모든 응용프로그램의 필수적인 지식원(knowledge source)이다. 본 연구에서는 문서 단위로 정렬된 병렬 코퍼스와 기본적인 대역어 사전을 이용하여 영-한 대역어를 자동으로 추출하는 방법에 대해 소개한다. 이 방법은 수집된 병렬 코퍼스의 크기에 영향을 받지 않는 방법이다. 문서 단위로 정렬된 병렬 코퍼스로부터 문장 단위의 정렬을 수행하고 다시 단어 단위의 정렬을 수행한 후, 정렬이 채 되지 않은 부분에 대해 추정 정렬을 수행한다. 추정 정렬에는 문장에서의 위치, 다른 단어와의 관계, 두 언어간의 언어적 정보등 다양한 정보가 사용된다. 이렇게 추정 정렬된 단어쌍으로부터 영-한 대역어를 추출할 수 있다. 약 1,000개로 구성된 병렬 코퍼스로부터 추출한 영-한 대역어는 71.7%의 정확도를 얻을 수 있었다.

English-Korean speech translation corpus (EnKoST-C): Construction procedure and evaluation results

  • Jeong-Uk Bang;Joon-Gyu Maeng;Jun Park;Seung Yun;Sang-Hun Kim
    • ETRI Journal
    • /
    • 제45권1호
    • /
    • pp.18-27
    • /
    • 2023
  • We present an English-Korean speech translation corpus, named EnKoST-C. End-to-end model training for speech translation tasks often suffers from a lack of parallel data, such as speech data in the source language and equivalent text data in the target language. Most available public speech translation corpora were developed for European languages, and there is currently no public corpus for English-Korean end-to-end speech translation. Thus, we created an EnKoST-C centered on TED Talks. In this process, we enhance the sentence alignment approach using the subtitle time information and bilingual sentence embedding information. As a result, we built a 559-h English-Korean speech translation corpus. The proposed sentence alignment approach showed excellent performance of 0.96 f-measure score. We also show the baseline performance of an English-Korean speech translation model trained with EnKoST-C. The EnKoST-C is freely available on a Korean government open data hub site.

영-한 조어단위 대역쌍 추출을 위한 조어단위 정렬 모델 (An Alignment Model for Extracting English-Korean Translations of Term Constituents)

  • 오종훈;황금하;최기선
    • 한국정보과학회논문지:소프트웨어및응용
    • /
    • 제32권4호
    • /
    • pp.300-311
    • /
    • 2005
  • 전문용어는 전문분야의 개념을 표현하는 언어적 표현이다. 전문용어의 조어단위는 전문용어를 구성하는 최소의 형태적 단위이다. 따라서 조어단위는 전문용어의 의미를 파악하는데 중요한 요소이다. 하지만 조어단위를 이용한 전문용어의 의미파악은 ‘조어단위와 개념단위의 불일치 문제’, 조어 단위의 ‘동형이의어’, ‘동의어’문제 둥으로 인한 어려움이 있다. 이러한 문제를 해결하기 위해서는 하나의 개념을 나타내는 조어단위의 덩어리인 개념단위를 파악하는 작업이 선행되어야 한다. 본 논문에서는 영어의 조어단위를 하나의 개념단위로 정의하고 개념단위에 대응되는 한국어 조어단위의 집합을 개념단위로 인식한다. 개념단위의 파악과정은 영한 대역 전문용어사전에 대한 영어-한국어 조어단위 정렬문제로 해결하고자 한다. 본 논문의 기법은 물리, 화학, 생물 분야에 대한 조어정렬 실험을 수행하였으며, 평균 약 $93\%$의 정확률로 조어단위 간의 정렬을 수행하였다

Comparison of Phone Boundary Alignment between Handlabels and Autolabels

  • Jang, Tae-Yeoub;Chung, Hyun-Song
    • 음성과학
    • /
    • 제10권1호
    • /
    • pp.27-39
    • /
    • 2003
  • This study attempts to verify the reliability of automatically generated segment labels as compared to those obtained by conventional labelling by hand. First of all, an autolabeller is constructed using the standard HMM speech recognition technique. For evaluation, we compare the automatically generated labels with manually annotated labels for the same speech data. The comparison is performed by calculating the temporal difference between an autolabel boundary and its corresponding hand label boundary. When the mismatched duration between two labels falls within 10 msec, we consider the autolabel as correct. The results suggest that overall 78% of autolabels are correctly obtained. It is found that the boundary of obstruents is better aligned than that of sonorants and vowels. In case of stop sound classes, strong stops in manner-of-articulation wise and velar stops in place-of-articulation wise show better performance in boundary alignment. The result suggests that more phone-specific consideration is necessary to improve autosegmentation performance.

  • PDF

음절의 시작과 단어 시작의 불일치가 영어 단어 인지에 미치는 영향 (The Effects of Misalignment between Syllable and Word Onsets on Word Recognition in English)

  • 김선미;남기춘
    • 말소리와 음성과학
    • /
    • 제1권4호
    • /
    • pp.61-71
    • /
    • 2009
  • This study aims to investigate whether the misalignment between syllable and word onsets due to the process of resyllabification affects Korean-English late bilinguals perceiving English continuous speech. Two word-spotting experiments were conducted. In Experiment 1, misalignment conditions (resyllabified conditions) were created by adding CVC contexts at the beginning of vowel-initial words and alignment conditions (non-resyllabified conditions) were made by putting the same CVC contexts at the beginning of consonant-initial words. The results of Experiment 1 showed that detections of targets in alignment conditions were faster and more correct than in misalignment conditions. Experiment 2 was conducted in order to avoid any possibilities that the results of Experiment 1 were due to consonant-initial words being easier to recognize than vowel-initial words. For this reason, all the experimental stimuli of Experiment 2 were vowel-initial words preceded by CVC contexts or CV contexts. Experiment 2 also showed misalignment cost when recognizing words in resyllabified conditions. These results indicate that Korean listeners are influenced by misalignment between syllable and word onsets triggered by a resyllabification process when recognizing words in English connected speech.

  • PDF

한국인 영어 학습자의 발음 정확성 자동 측정방법에 대한 연구 (A Study on Automatic Measurement of Pronunciation Accuracy of English Speech Produced by Korean Learners of English)

  • 윤원희;정현성;장태엽
    • 대한음성학회:학술대회논문집
    • /
    • 대한음성학회 2005년도 추계 학술대회 발표논문집
    • /
    • pp.17-20
    • /
    • 2005
  • The purpose of this project is to develop a device that can automatically measure pronunciation of English speech produced by Korean learners of English. Pronunciation proficiency will be measured largely in two areas; suprasegmental and segmental areas. In suprasegmental area, intonation and word stress will be traced and compared with those of native speakers by way of statistical methods using tilt parameters. Durations of phones are also examined to measure speakers' naturalness of their pronunciations. In doing so, statistical duration modelling from a large speech database using CART will be considered. For segmental measurement of pronunciation, acoustic probability of a phone, which is a byproduct when doing the forced alignment, will be a basis of scoring pronunciation accuracy of a phone. The final score will be a feedback to the learners to improve their pronunciation.

  • PDF

Word class information in perception of prosodic prominence by Korean learners of English

  • Im, Suyeon
    • 말소리와 음성과학
    • /
    • 제11권4호
    • /
    • pp.1-8
    • /
    • 2019
  • This study aims to investigate how prosodic prominence is perceived in relation to word class information (or parts-of-speech) by Korean learners of English compared with native English speakers in public speech. Two groups, Korean learners of English and native English speakers, were asked to judge words perceived as prominent simultaneously while listening to a speech. Parts-of-speech and three acoustic cues (i.e., max F0, mean phone duration, and mean intensity) were analyzed for each word in the speech. The results showed that content words tended to be higher in pitch and longer in duration than function words. Both groups of listeners rated prominence on content words more frequently than on function words. This tendency, however, was significantly greater for Korean learners of English than for native English speakers. Among the parts-of-speech of the content words, Korean learners of English were more likely than native English speakers to judge nouns and verbs as prominent. This study presents evidence that Korean learners of English consider most, if not all, content words as landing locations of prosodic prominence, in alignment with the previous study on the production of prominence.

유로워드넷 방식에 기반한 한국어와 영어의 명사 상하위어 정렬 (Alignment of Hypernym-Hyponym Noun Pairs between Korean and English, Based on the EuroWordNet Approach)

  • 김동성
    • 한국언어정보학회지:언어와정보
    • /
    • 제12권1호
    • /
    • pp.27-65
    • /
    • 2008
  • This paper presents a set of methodologies for aligning hypernym-hyponym noun pairs between Korean and English, based on the EuroWordNet approach. Following the methods conducted in EuroWordNet, our approach makes extensive use of WordNet in four steps of the building process: 1) Monolingual dictionaries have been used to extract proper hypernym-hyponym noun pairs, 2) bilingual dictionary has converted the extracted pairs, 3) Word Net has been used as a backbone of alignment criteria, and 4) WordNet has been used to select the most similar pair among the candidates. The importance of this study lies not only on enriching semantic links between two languages, but also on integrating lexical resources based on a language specific and dependent structure. Our approaches are aimed at building an accurate and detailed lexical resource with proper measures rather than at fast development of generic one using NLP technique.

  • PDF

Automatic Acquisition of Paraphrases Using Bilingual Dependency Relations

  • Hwang, Young-Sook;Kim, Young-Kil
    • ETRI Journal
    • /
    • 제30권1호
    • /
    • pp.155-157
    • /
    • 2008
  • This letter introduces a new method to automatically acquire paraphrases using bilingual corpora. It utilizes the bilingual dependency relations obtained by projecting a monolingual dependency parse onto the other language's sentence based on statistical alignment techniques. Since the proposed paraphrasing method can clearly disambiguate the sense of the original phrases using the bilingual context of dependency relations, it would be possible to obtain interchangeable paraphrases under a given context. Through experiments with parallel corpora of Korean and English language pairs, we demonstrate that our method effectively extracts paraphrases with high precision, achieving success rates of 94.3% and 84.6%, respectively, for Korean and English.

  • PDF