• 제목/요약/키워드: bilingual text

검색결과 18건 처리시간 0.025초

연결요소를 이용한 한.영 혼용문서의 구조분석 및 낱자분리 (Bilingual document analysis and character segmentation using connected components)

  • 김민기;권영빈;한상용
    • 한국통신학회논문지
    • /
    • 제22권3호
    • /
    • pp.410-422
    • /
    • 1997
  • In this paper, we descried a bottom-up document structure analysis method in bilingual Korean-English document. We proposed a character segmentation method based on the layout information of connected component of each character. In many researches, a document has been analyzed into text blocks and graphics. We analyzed a document into four parts: text, table, graphic, and separator. A text is recursively subdivided into text blocks, text lines, words, and characters. To extract the character in bilingual text, we proposed a new method of word of word separation of Korean or English. Futhermore, we used a character merging and segmentation method in accordance with the properties of Hangul on the Korean word blocks. Experimental results on the various documents show that the proposed method is very effectively operated on the document structure analysis and the character segmentation.

  • PDF

감정점수의 전파를 통한 한국어 감정사전 생성 (Generating a Korean Sentiment Lexicon Through Sentiment Score Propagation)

  • 박호민;김창현;김재훈
    • 정보처리학회논문지:소프트웨어 및 데이터공학
    • /
    • 제9권2호
    • /
    • pp.53-60
    • /
    • 2020
  • 감정분석은 문서 또는 대화상에서 주어진 주제에 대한 태도와 의견을 이해하는 과정이다. 감정분석에는 다양한 접근법이 있다. 그 중 하나는 감정사전을 이용하는 사전 기반 접근법이다. 본 논문에서는 널리 알려진 영어 감정사전인 VADER를 활용하여 한국어 감정사전을 자동으로 생성하는 방법을 제안한다. 제안된 방법은 세 단계로 구성된다. 첫 번째 단계는 한영 병렬 말뭉치를 사용하여 한영 이중언어 사전을 제작한다. 제작된 이중언어 사전은 VADER 감정어와 한국어 형태소 쌍들의 집합이다. 두 번째 단계는 그 이중언어 사전을 사용하여 한영 단어 그래프를 생성한다. 세 번째 단계는 생성된 단어 그래프 상에서 레이블 전파 알고리즘을 실행하여 새로운 감정사전을 구축한다. 이와 같은 과정으로 생성된 한국어 감정사전을 유용성을 보이려고 몇 가지 실험을 수행하였다. 본 논문에서 생성된 감정사전을 이용한 감정 분류기가 기존의 기계학습 기반 감정분류기보다 좋은 성능을 보였다. 앞으로 본 논문에서 제안된 방법을 적용하여 여러 언어의 감정사전을 생성하려고 한다.

English-Korean speech translation corpus (EnKoST-C): Construction procedure and evaluation results

  • Jeong-Uk Bang;Joon-Gyu Maeng;Jun Park;Seung Yun;Sang-Hun Kim
    • ETRI Journal
    • /
    • 제45권1호
    • /
    • pp.18-27
    • /
    • 2023
  • We present an English-Korean speech translation corpus, named EnKoST-C. End-to-end model training for speech translation tasks often suffers from a lack of parallel data, such as speech data in the source language and equivalent text data in the target language. Most available public speech translation corpora were developed for European languages, and there is currently no public corpus for English-Korean end-to-end speech translation. Thus, we created an EnKoST-C centered on TED Talks. In this process, we enhance the sentence alignment approach using the subtitle time information and bilingual sentence embedding information. As a result, we built a 559-h English-Korean speech translation corpus. The proposed sentence alignment approach showed excellent performance of 0.96 f-measure score. We also show the baseline performance of an English-Korean speech translation model trained with EnKoST-C. The EnKoST-C is freely available on a Korean government open data hub site.

Ranking Translation Word Selection Using a Bilingual Dictionary and WordNet

  • Kim, Kweon-Yang;Park, Se-Young
    • 한국지능시스템학회논문지
    • /
    • 제16권1호
    • /
    • pp.124-129
    • /
    • 2006
  • This parer presents a method of ranking translation word selection for Korean verbs based on lexical knowledge contained in a bilingual Korean-English dictionary and WordNet that are easily obtainable knowledge resources. We focus on deciding which translation of the target word is the most appropriate using the measure of semantic relatedness through the 45 extended relations between possible translations of target word and some indicative clue words that play a role of predicate-arguments in source language text. In order to reduce the weight of application of possibly unwanted senses, we rank the possible word senses for each translation word by measuring semantic similarity between the translation word and its near synonyms. We report an average accuracy of $51\%$ with ten Korean ambiguous verbs. The evaluation suggests that our approach outperforms the default baseline performance and previous works.

Computerized Sound Dictionary of Korean and English

  • Kim, Jong-Mi
    • 음성과학
    • /
    • 제8권1호
    • /
    • pp.33-52
    • /
    • 2001
  • A bilingual sound dictionary in Korean and English has been created for a broad range of sound reference to cross-linguistic, dialectal, native language (L1)-transferred biological and allophonic variations. The paper demonstrates that the pronunciation dictionary of the lexicon is inadequate for sound reference due to the preponderance of unmarked sounds. The audio registry consists of the three-way comparison of 1) English speech from native English speakers, 2) Korean speech from Korean speakers, and 3) English speech from Korean speakers. Several sub-dictionaries have been created as the foundation research for independent development. They are 1) a pronunciation dictionary of the Korean lexicon in a keyboard-compatible phonetic transcription, 2) a sound dictionary of L1-interfered language, and 3) an audible dictionary of Korean sounds. The dictionary was designed to facilitate the exchange of the speech signal and its corresponding text data on various media particularly on CD-ROM. The methodology and findings of the construction are discussed.

  • PDF

수어 동작 키포인트 중심의 시공간적 정보를 강화한 Sign2Gloss2Text 기반의 수어 번역 (Sign2Gloss2Text-based Sign Language Translation with Enhanced Spatial-temporal Information Centered on Sign Language Movement Keypoints)

  • 김민채;김정은;김하영
    • 한국멀티미디어학회논문지
    • /
    • 제25권10호
    • /
    • pp.1535-1545
    • /
    • 2022
  • Sign language has completely different meaning depending on the direction of the hand or the change of facial expression even with the same gesture. In this respect, it is crucial to capture the spatial-temporal structure information of each movement. However, sign language translation studies based on Sign2Gloss2Text only convey comprehensive spatial-temporal information about the entire sign language movement. Consequently, detailed information (facial expression, gestures, and etc.) of each movement that is important for sign language translation is not emphasized. Accordingly, in this paper, we propose Spatial-temporal Keypoints Centered Sign2Gloss2Text Translation, named STKC-Sign2 Gloss2Text, to supplement the sequential and semantic information of keypoints which are the core of recognizing and translating sign language. STKC-Sign2Gloss2Text consists of two steps, Spatial Keypoints Embedding, which extracts 121 major keypoints from each image, and Temporal Keypoints Embedding, which emphasizes sequential information using Bi-GRU for extracted keypoints of sign language. The proposed model outperformed all Bilingual Evaluation Understudy(BLEU) scores in Development(DEV) and Testing(TEST) than Sign2Gloss2Text as the baseline, and in particular, it proved the effectiveness of the proposed methodology by achieving 23.19, an improvement of 1.87 based on TEST BLEU-4.

이중언어능력의 조선족 아동과 청소년의 한글, 한자, 한글.한자혼합문 형태의 덩이글 이해에 관한 연구 (A Study on the Comprehension of Texts with Korean Hangul, Chinese Hanja and Hangul.Hanja among Korean-Chinese children and adolescents)

  • 윤혜경;박혜원;권오식
    • 아동학회지
    • /
    • 제30권2호
    • /
    • pp.15-28
    • /
    • 2009
  • This study focused on the comprehension of texts written either in Korean script (Hangul) or Chinese script (Hanja). For this purpose, we measured the reading time and the correct response in text comprehension tasks with 104 Korean-Chinese children who were either 10 or 19 years old. There was a main effect of script : The reading time of Hanja texts was shorter than that of Hangul or Hangul Hanja mixed texts. But the older subjects who spent the same reading time in both Hangul and Hanja texts showed the longer reading time in Hangul Hanja mixed texts revealing the interaction between age and script. The correct response rate on the comprehension task was the highest in Hangul text. The results were discussed in relation to the independent dual language processing systems in Korean-Chinese.

  • PDF

The Use of MSVM and HMM for Sentence Alignment

  • Fattah, Mohamed Abdel
    • Journal of Information Processing Systems
    • /
    • 제8권2호
    • /
    • pp.301-314
    • /
    • 2012
  • In this paper, two new approaches to align English-Arabic sentences in bilingual parallel corpora based on the Multi-Class Support Vector Machine (MSVM) and the Hidden Markov Model (HMM) classifiers are presented. A feature vector is extracted from the text pair that is under consideration. This vector contains text features such as length, punctuation score, and cognate score values. A set of manually prepared training data was assigned to train the Multi-Class Support Vector Machine and Hidden Markov Model. Another set of data was used for testing. The results of the MSVM and HMM outperform the results of the length based approach. Moreover these new approaches are valid for any language pairs and are quite flexible since the feature vector may contain less, more, or different features, such as a lexical matching feature and Hanzi characters in Japanese-Chinese texts, than the ones used in the current research.