• Title/Summary/Keyword: Parallel corpus

Search Result 66, Processing Time 0.025 seconds

The Parallel Corpus Approach to Building the Syntactic Tree Transfer Set in the English-to- Vietnamese Machine Translation

  • Dien Dinh;Ngan Thuy;Quang Xuan;Nam Chi
    • Proceedings of the IEEK Conference
    • /
    • summer
    • /
    • pp.382-386
    • /
    • 2004
  • Recently, with the machine learning trend, most of the machine translation systems on over the world use two syntax tree sets of two relevant languages to learn syntactic tree transfer rules. However, for the English-Vietnamese language pair, this approach is impossible because until now we have not had a Vietnamese syntactic tree set which is correspondent to English one. Building of a very large correspondent Vietnamese syntactic tree set (thousands of trees) requires so much work and take the investment of specialists in linguistics. To take advantage from our available English-Vietnamese Corpus (EVC) which was tagged in word alignment, we choose the SITG (Stochastic Inversion Transduction Grammar) model to construct English- Vietnamese syntactic tree sets automatically. This model is used to parse two languages at the same time and then carry out the syntactic tree transfer. This English-Vietnamese bilingual syntactic tree set is the basic training data to carry out transferring automatically from English syntactic trees to Vietnamese ones by machine learning models. We tested the syntax analysis by comparing over 10,000 sentences in the amount of 500,000 sentences of our English-Vietnamese bilingual corpus and first stage got encouraging result $(analyzed\;about\;80\%)[5].$ We have made use the TBL algorithm (Transformation Based Learning) to carry out automatic transformations from English syntactic trees to Vietnamese ones based on that parallel syntactic tree transfer set[6].

  • PDF

Cross-Lingual Style-Based Title Generation Using Multiple Adapters (다중 어댑터를 이용한 교차 언어 및 스타일 기반의 제목 생성)

  • Yo-Han Park;Yong-Seok Choi;Kong Joo Lee
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.12 no.8
    • /
    • pp.341-354
    • /
    • 2023
  • The title of a document is the brief summarization of the document. Readers can easily understand a document if we provide them with its title in their preferred styles and the languages. In this research, we propose a cross-lingual and style-based title generation model using multiple adapters. To train the model, we need a parallel corpus in several languages with different styles. It is quite difficult to construct this kind of parallel corpus; however, a monolingual title generation corpus of the same style can be built easily. Therefore, we apply a zero-shot strategy to generate a title in a different language and with a different style for an input document. A baseline model is Transformer consisting of an encoder and a decoder, pre-trained by several languages. The model is then equipped with multiple adapters for translation, languages, and styles. After the model learns a translation task from parallel corpus, it learns a title generation task from monolingual title generation corpus. When training the model with a task, we only activate an adapter that corresponds to the task. When generating a cross-lingual and style-based title, we only activate adapters that correspond to a target language and a target style. An experimental result shows that our proposed model is only as good as a pipeline model that first translates into a target language and then generates a title. There have been significant changes in natural language generation due to the emergence of large-scale language models. However, research to improve the performance of natural language generation using limited resources and limited data needs to continue. In this regard, this study seeks to explore the significance of such research.

Expression of PAPP-A and $20{\alpha}$-HSD in the Bovine Corpus Luteum during Early Pregnancy (소의 초기 임신 황체에서 PAPP-A와 $20{\alpha}$-HSD의 발현 양상)

  • Kim, Dae-Seung;Kim, Sang-Hwan;Yoon, Jong-Taek
    • Journal of Embryo Transfer
    • /
    • v.26 no.1
    • /
    • pp.57-63
    • /
    • 2011
  • This study was performed to the expressions of pregnancy-associated plasma protein-A (PAPP-A) and 20alpha-hydroxysteroid dehydrogenase ($20{\alpha}$-HSD) in bovine corpus luteum during early pregnancy. To determine the function of PAPP-A gene during early pregnancy, we collected corpus luteum samples on 30, 60 and 90 days of pregnancy in bovine. The mRNA expression of PAPP-A, $20{\alpha}$-HSD, progesterone-receptor (PR) and insulin-like growth factor binding protein4 (IGFBP4) gene was conducted by Real-time PCR. In parallel with mRNA levels, The protein expressions of PAPP-A and $20{\alpha}$-HSD were detected by immunological analysis. The mRNA expressions $20{\alpha}$-HSD and PAPP-A significantly increased on day 90 in the corpus luteum during pregnancy. The mRNA expression of PR and JGFBP4 in the corpus luteum progressively was enhanced at 30 to 60 day, but decreased on 90 day of pregnancy in the corpus luteum. The expression patterns of these genes, PAPP-A and $20{\alpha}$-HSD were similar pattern in these tissues. In conclusion, PAPP-A and $20{\alpha}$-HSD activity in corpus luteum could be played a role for early pregnancy manifestation.

Extracting Korean-English Parallel Sentences from Wikipedia (위키피디아로부터 한국어-영어 병렬 문장 추출)

  • Kim, Sung-Hyun;Yang, Seon;Ko, Youngjoong
    • Journal of KIISE:Software and Applications
    • /
    • v.41 no.8
    • /
    • pp.580-585
    • /
    • 2014
  • This paper conducts a variety of experiments for "the extraction of Korean parallel sentences using Wikipedia data". We refer to various methods that were previously proposed for other languages. We use two approaches. The first one is to use translation probabilities that are extracted from the existing resources such as Sejong parallel corpus, and the second one is to use dictionaries such as Wiki dictionary consisting of Wikipedia titles and MRDs (machine readable dictionaries). Experimental results show that we obtained a significant improvement in system using Wikipedia data in comparison to one using only the existing resources. We finally achieve an outstanding performance, an F1-score of 57.6%. We additionally conduct experiments using a topic model. Although this experiment shows a relatively lower performance, an F1-score of 51.6%, it is expected to be worthy of further studies.

Automatically Extracting Unknown Translations Using Phrase Alignment (정렬기법을 이용한 미등록 대역어의 자동 추출)

  • Kim, Jae-Hoon;Yang, Sung-Il
    • The KIPS Transactions:PartB
    • /
    • v.14B no.3 s.113
    • /
    • pp.231-240
    • /
    • 2007
  • In this paper, we propose an automatic extraction model for unknown translations and implement an unknown translation extraction system using the proposed model. The proposed model as a phrase-alignment model is incorporated with three models: a phrase-boundary model, a language model, and a translation model. Using the proposed model we implement the system for extracting unknown translations, which consists of three parts: construction of parallel corpora, alignment of Korean and English words, extraction of unknown translations. To evaluate the performance of the proposed system we have established the reference corpus for extracting unknown translation, which comprises of 2,220 parallel sentences including about 1,500 unknown translations. Through several experiments, we have observed that the proposed model is very useful for extracting unknown translations. In the future, researches on objective evaluation and establishment of parallel corpora with good quality should be performed and studies on improving the performance of unknown translation extraction should be kept up.

Performance Improvement of Bilingual Lexicon Extraction via Pivot Language and Word Alignment Tool (중간언어와 단어정렬을 통한 이중언어 사전의 자동 추출에 대한 성능 개선)

  • Kwon, Hong-Seok;Seo, Hyeung-Won;Kim, Jae-Hoon
    • Annual Conference on Human and Language Technology
    • /
    • 2013.10a
    • /
    • pp.27-32
    • /
    • 2013
  • 본 논문은 잘 알려지지 않은 언어 쌍에 대해서 병렬말뭉치(parallel corpus)로부터 자동으로 이중언어 사전을 추출하는 방법을 제안하였다. 이 방법은 중간언어(pivot language)를 매개로 하고 문맥 벡터를 생성하기 위해 공개된 단어 정렬 도구인 Anymalign을 사용하였다. 그 결과로 초기사전(seed dictionary)을 사용한 문맥벡터의 번역 과정이 필요 없으며 통계적 방법의 약점인 낮은 빈도수를 가지는 어휘에 대한 번역 정확도를 높였다. 또한 문맥벡터의 요소 값으로 특정 임계값 이상을 가지는 양방향 번역 확률 정보를 사용하여 상위 5위 이내의 번역 정확도를 크게 높였다. 본 논문은 두 개의 서로 다른 언어 쌍 한국어-스페인어 그리고 한국어-프랑스어 양방향에 대해서 각각 이중언어 사전을 추출하는 실험을 하였다. 높은 빈도수를 가지는 어휘에 대한 번역 정확도는 이전 연구에서 보인 실험 결과에 비해 최소 3.41% 최대 67.91%의 성능 향상을 보였고 낮은 빈도수를 가지는 어휘에 대한 번역 정확도는 최소 5.06%, 최대 990%의 성능 향상을 보였다.

  • PDF

The Use of MSVM and HMM for Sentence Alignment

  • Fattah, Mohamed Abdel
    • Journal of Information Processing Systems
    • /
    • v.8 no.2
    • /
    • pp.301-314
    • /
    • 2012
  • In this paper, two new approaches to align English-Arabic sentences in bilingual parallel corpora based on the Multi-Class Support Vector Machine (MSVM) and the Hidden Markov Model (HMM) classifiers are presented. A feature vector is extracted from the text pair that is under consideration. This vector contains text features such as length, punctuation score, and cognate score values. A set of manually prepared training data was assigned to train the Multi-Class Support Vector Machine and Hidden Markov Model. Another set of data was used for testing. The results of the MSVM and HMM outperform the results of the length based approach. Moreover these new approaches are valid for any language pairs and are quite flexible since the feature vector may contain less, more, or different features, such as a lexical matching feature and Hanzi characters in Japanese-Chinese texts, than the ones used in the current research.

Chunking Korean and an Application (한국어 낱말 묶기와 그 응용)

  • Un Koaunghi;Hong Jungha;You Seok-Hoon;Lee Kiyong;Choe Jae-Woong
    • Language and Information
    • /
    • v.9 no.2
    • /
    • pp.49-68
    • /
    • 2005
  • Application of chunking to English and some other European languages has shown that it is a viable parsing mechanism for natural languages. Although a small number of attempts have been made to apply chunking to the analysis of the Korean language, it still is not clear enough what criteria there are to identify appropriate units of chunking, and how efficient and valid the chunking algorithms would be when applied to some authentic Korean texts. The purpose of this research is to provide an alternative set of algorithms for chunking Korean, and to implement them, and to test them against some English-Korean parallel corpora, which is English and Korean bibles matched sentence by sentence. It is shown in the paper that aligning related texts and identifying matched phrases between the two languages can be achieved through appropriate chunking and matching algorithms defined on the morphologically-tagged parallel corpus. Chunking and matching processes are based on the content words rather than the function words, and the matching itself is done in terms of the transfer dictionary. The implementation is done in C and XML, and can be accessed through the Internet.

  • PDF

Sacral Insufficiency Fractures : How to Classify?

  • Bakker, Gesa;Hattingen, Joerg;Stuetzer, Hartmut;Isenberg, Joerg
    • Journal of Korean Neurosurgical Society
    • /
    • v.61 no.2
    • /
    • pp.258-266
    • /
    • 2018
  • Objective : The diagnosis of insufficiency fractures of the sacrum in an elder population increases annually. Fractures show very different morphology. We aimed to classify sacral insufficiency fractures according to the position of cortical break and possible need for intervention. Methods : Between January 1, 2008 and December 31, 2014, all patients with a proven fracture of the sacrum following a low-energy or an even unnoticed trauma were prospectively registered : 117 females and 13 males. All patients had a computer tomography of the pelvic ring, two patients had a magnetic resonance imaging additionally : localization and involvement of the fracture lines into the sacroiliac joint, neural foramina or the spinal canal were identified. Results : Patients were aged between 46 and 98 years (mean, 79.8 years). Seventy-seven patients had an unilateral fracture of the sacral ala, 41 bilateral ala fractures and 12 patients showed a fracture of the sacral corpus : a total of 171 fractures were analyzed. The first group A included fractures of the sacral ala which were assessed to have no or less mechanical importance (n=53) : fractures with no cortical disruption ("bone bruise") (A1; n=2), cortical deformation of the anterior cortical bone (A2; n=4), and fracture of the anterolateral rim of ala (A3; n=47). Complete fractures of the sacral ala (B; n=106) : parallel to the sacroiliac joint (B1; n=63), into the sacroiliac joint (B2; n=19), and involvement of the sacral foramina respectively the spinal canal (B3; n=24). Central fractures involving the sacral corpus (C; n=12) : fracture limited to the corpus or finishing into one ala (C1; n=3), unidirectional including the neural foramina or the spinal canal or both (C2; n=2), and horizontal fractures of the corpus with bilateral sagittal completion (C3; n=8). Sixty-eight fractures proceeded into the sacroiliac joint, 34 fractures showed an injury of foramina or canal. Conclusion : The new classification allowes the differentiation of fractures of less mechanical importance and a risk assessment for possible polymethyl methacrylate leaks during sacroplasty in the direction of the neurological structures. In addition, identification of instable fractures in need for laminectomy and surgical stabilization is possible.