• Title/Summary/Keyword: Data Translation

Search Result 647, Processing Time 0.023 seconds

English-Korean speech translation corpus (EnKoST-C): Construction procedure and evaluation results

  • Jeong-Uk Bang;Joon-Gyu Maeng;Jun Park;Seung Yun;Sang-Hun Kim
    • ETRI Journal
    • /
    • v.45 no.1
    • /
    • pp.18-27
    • /
    • 2023
  • We present an English-Korean speech translation corpus, named EnKoST-C. End-to-end model training for speech translation tasks often suffers from a lack of parallel data, such as speech data in the source language and equivalent text data in the target language. Most available public speech translation corpora were developed for European languages, and there is currently no public corpus for English-Korean end-to-end speech translation. Thus, we created an EnKoST-C centered on TED Talks. In this process, we enhance the sentence alignment approach using the subtitle time information and bilingual sentence embedding information. As a result, we built a 559-h English-Korean speech translation corpus. The proposed sentence alignment approach showed excellent performance of 0.96 f-measure score. We also show the baseline performance of an English-Korean speech translation model trained with EnKoST-C. The EnKoST-C is freely available on a Korean government open data hub site.

Implementation of CAD Data Translation System using STEP (STEP을 이용한 CAD 데이터 변환 시스템의 구현)

  • 이영준;고굉욱;유상봉
    • Korean Journal of Computational Design and Engineering
    • /
    • v.1 no.2
    • /
    • pp.87-96
    • /
    • 1996
  • IGES is a file format which has gained widespread use but has certain limitations such as limited information coverage and ambiguous definitions. In order to overcome the limitations of existing neutral file formats, STEP has been developed as a more comprehensive mechanism for product data exchange by ISO. This paper describes a file translation system between IGES and STEP. In this system, three EXPRESS schemata are defined for IGES, STEP and the translation relationship between IGES and STEP. Object codes are generated from the schemata and linked with file access libraries to IGES and STEP files. The translation was verified by visualization and reverse translation. The system developed in this study can easily applied to translate other file formats because the file structure and translation relationship are defined in EXPRESS - a high level information modeling language.

  • PDF

A Study on the Performance Improvement of Machine Translation Using Public Korean-English Parallel Corpus (공공 한영 병렬 말뭉치를 이용한 기계번역 성능 향상 연구)

  • Park, Chanjun;Lim, Heuiseok
    • Journal of Digital Convergence
    • /
    • v.18 no.6
    • /
    • pp.271-277
    • /
    • 2020
  • Machine translation refers to software that translates a source language into a target language, and has been actively researching Neural Machine Translation through rule-based and statistical-based machine translation. One of the important factors in the Neural Machine Translation is to extract high quality parallel corpus, which has not been easy to find high quality parallel corpus of Korean language pairs. Recently, the AI HUB of the National Information Society Agency(NIA) unveiled a high-quality 1.6 million sentences Korean-English parallel corpus. This paper attempts to verify the quality of each data through performance comparison with the data published by AI Hub and OpenSubtitles, the most popular Korean-English parallel corpus. As test data, objectivity was secured by using test set published by IWSLT, official test set for Korean-English machine translation. Experimental results show better performance than the existing papers tested with the same test set, and this shows the importance of high quality data.

Discriminative Models for Automatic Acquisition of Translation Equivalences

  • Zhang, Chun-Xiang;Li, Sheng;Zhao, Tie-Jun
    • International Journal of Control, Automation, and Systems
    • /
    • v.5 no.1
    • /
    • pp.99-103
    • /
    • 2007
  • Translation equivalence is very important for bilingual lexicography, machine translation system and cross-lingual information retrieval. Extraction of equivalences from bilingual sentence pairs belongs to data mining problem. In this paper, discriminative learning methods are employed to filter translation equivalences. Discriminative features including translation literality, phrase alignment probability, and phrase length ratio are used to evaluate equivalences. 1000 equivalences randomly selected are filtered and then evaluated. Experimental results indicate that its precision is 87.8% and recall is 89.8% for support vector machine.

A Study on 『Korean Translation of ·』 -Focused on declared characteristics and characteristics in different versions- (『국역본 <>·<>』 고찰 -표기적 특징과 이본적 성격을 중심으로-)

  • Kan, Ho-yun
    • Journal of Korean Classical Literature and Education
    • /
    • no.15
    • /
    • pp.355-387
    • /
    • 2008
  • The purpose of the study was to decide Korean translation and the copying period of "Korean Translation of " and to look all around their characteristics in different versions carefully until now. The "Korean Translation" is a collection of Korean-translated romance and love stories excavated by a professor Kim,Il Geun, and there is not a little meaning in the context of novel history in the point of view of 'Korean translation of a court possession'. Arranging conclusion of the study generally, it is as follows. (1) Considering phonological phenomena, grammar and vocabulary in the study of Korean language, it is presumed that they would be translated into Korean and copied between the regime period of the King Sukjong and the regime period of the King Yungjo in the Joseon Dynasty. For, they were composed of a middle declaration of copied 'Myeoknambon "Korean Translation of Taepyeonggwanggi(태평광기)"' and 'NakseonJaebon(낙선재본)' between the middle of the 17th century and the middle of the 18th century and the regime period of the King Jeongjo in the Joseon Dynasty appointed as the background period of the novels should be excepted. Consequently, through the Korean Translation, we can confirm that the novel scope between the 17th century and the 18th century in Korean novel history was widened until 'The Royal Court' and 'Women'. (2) In the side of vocabulary, the "Korean Translation" also has not a little meaning in the side of a collection translated in the Royal Court. It doesn't have new vocabularies, but partial vocabularies as '(Traces:痕)' '(Clean eyes:明眸)', ' (Sail:帆)', '(Get up:起)', '글이플(Weak grass:弱草)', '쇼록(Owl:? 梟 or 鴉?)', '이 사라심(This life:此生)', and '노혀오매(Look for:訪)' are good data in the study of Korean language. (3) The "Korean Translation" is a valuable data about translation and copying of a court novel and we can discover intentionally changed parts and partially omitted sentences rather in the than in the . There are differences between a translation book and a copying book and we can catch sight of intention of translation and unsettledness of copying in the second work. Therefore, we can know that the "Korean Translation" has a double context which one work is translated and a work in different version is derived, compared to a simple copy. (4) The "Korean Translation" has a close relation with "Hangoldong(閒汨董)", but it doesn't regard the same copy as a foundation. The basic copy of translation of the "Korean Translation" is a different version of the same line as "Hangoldong" and "Jeochobon(저초본:정명기 소장본)" and is more similar line to "Hangoldong", but it is also not the same basic copy. (5) Considering that the "Korean Translation" doesn't has a distinct relation with the "Hangoldong", there is no correlation between the "Korean Translation" and and the "Hangoldong" and . In addition, we could not discover a writer's identity between the two.

Hindi Correspondence of Bengali Nominal Suffixes

  • Chatterji, Sanjay
    • Journal of Multimedia Information System
    • /
    • v.8 no.4
    • /
    • pp.221-232
    • /
    • 2021
  • One bottleneck of Bengali to Hindi transfer based machine translation system is the translation of suffixes of noun. The appropriate translation of a nominal suffix often depends on the semantic role of the corresponding noun chunk in the sentence. With the availability of a high performance Bengali morphological analyzer and a basic Bengali parser it is possible to identify the role of each noun chunk. This information may be used for building rules for translating the ambiguous nominal suffixes. As there are some similarities between the uses of Bengali and Hindi nominal suffixes we find that the rules may be identified by linguistically analyzing corpus data. In this paper, we identify rules for the ambiguous four Bengali nominal suffixes from corpus data and evaluate their performances. This set of rules is able to resolve a majority of the nominal suffix ambiguities in Bengali to Hindi transfer based machine translation system. Using the rules, we are able to translate 98.17% Bengali nouns correctly which is much better than the baseline ILMT system's accuracy of 62.8%.

Optimization of Data Augmentation Techniques in Neural Machine Translation (신경망 기계번역에서 최적화된 데이터 증강기법 고찰)

  • Park, Chanjun;Kim, Kuekyeng;Lim, Heuiseok
    • Annual Conference on Human and Language Technology
    • /
    • 2019.10a
    • /
    • pp.258-261
    • /
    • 2019
  • 딥러닝을 이용한 Sequence to Sequence 모델의 등장과 Multi head Attention을 이용한 Transformer의 등장으로 기계번역에 많은 발전이 있었다. Transformer와 같은 성능이 좋은 모델들은 대량의 병렬 코퍼스를 가지고 학습을 진행하였는데 대량의 병렬 코퍼스를 구축하는 것은 시간과 비용이 많이 드는 작업이다. 이러한 단점을 극복하기 위하여 합성 코퍼스를 만드는 기법들이 연구되고 있으며 대표적으로 Back Translation 기법이 존재한다. Back Translation을 이용할 시 단일 언어 데이터를 가상 병렬 데이터로 변환하여 학습데이터의 양을 증가 시킨다. 즉 말뭉치 확장기법의 일종이다. 본 논문은 Back Translation 뿐만 아니라 Copied Translation 방식을 통한 다양한 실험을 통하여 데이터 증강기법이 기계번역 성능에 미치는 영향에 대해서 살펴본다. 실험결과 Back Translation과 Copied Translation과 같은 데이터 증강기법이 기계번역 성능향상에 도움을 줌을 확인 할 수 있었으며 Batch를 구성할 때 상대적 가중치를 두는 것이 성능향상에 도움이 됨을 알 수 있었다.

  • PDF

Symbolizing Numbers to Improve Neural Machine Translation (숫자 기호화를 통한 신경기계번역 성능 향상)

  • Kang, Cheongwoong;Ro, Youngheon;Kim, Jisu;Choi, Heeyoul
    • Journal of Digital Contents Society
    • /
    • v.19 no.6
    • /
    • pp.1161-1167
    • /
    • 2018
  • The development of machine learning has enabled machines to perform delicate tasks that only humans could do, and thus many companies have introduced machine learning based translators. Existing translators have good performances but they have problems in number translation. The translators often mistranslate numbers when the input sentence includes a large number. Furthermore, the output sentence structure completely changes even if only one number in the input sentence changes. In this paper, first, we optimized a neural machine translation model architecture that uses bidirectional RNN, LSTM, and the attention mechanism through data cleansing and changing the dictionary size. Then, we implemented a number-processing algorithm specialized in number translation and applied it to the neural machine translation model to solve the problems above. The paper includes the data cleansing method, an optimal dictionary size and the number-processing algorithm, as well as experiment results for translation performance based on the BLEU score.

Deep Learning-based Korean Dialect Machine Translation Research Considering Linguistics Features and Service (언어적 특성과 서비스를 고려한 딥러닝 기반 한국어 방언 기계번역 연구)

  • Lim, Sangbeom;Park, Chanjun;Yang, Yeongwook
    • Journal of the Korea Convergence Society
    • /
    • v.13 no.2
    • /
    • pp.21-29
    • /
    • 2022
  • Based on the importance of dialect research, preservation, and communication, this paper conducted a study on machine translation of Korean dialects for dialect users who may be marginalized. For the dialect data used, AIHUB dialect data distributed based on the highest administrative district was used. We propose a many-to-one dialect machine translation that promotes the efficiency of model distribution and modeling research to improve the performance of the dialect machine translation by applying Copy mechanism. This paper evaluates the performance of the one-to-one model and the many-to-one model as a BLEU score, and analyzes the performance of the many-to-one model in the Korean dialect from a linguistic perspective. The performance improvement of the one-to-one machine translation by applying the methodology proposed in this paper and the significant high performance of the many-to-one machine translation were derived.

Design and Implementation of a Spatial Data Translation System (공간 데이타 변환 시스템의 설계 및 구현)

  • 이기영;노경택
    • Journal of the Korea Society of Computer and Information
    • /
    • v.8 no.4
    • /
    • pp.41-46
    • /
    • 2003
  • Recently, as the growth of the application of geographical information in various application fields, a geographic information system(GIS) has been building for using them efficiently. GIS has been managing them in use data format individually. To service geographical information based distribution efficiently, there must be a Data Conversion System that can deal with converting of GIS geographical information with incompatible formats. Therefore, in this paper, we design and implement a Spatial Data Translation System with international standard to convert geographic data efficiently.

  • PDF