• 제목/요약/키워드: Syllable-based Part-of-Speech Tagging

검색결과 4건 처리시간 0.019초

딥러닝을 이용한 한국어 Head-Tail 토큰화 기법과 품사 태깅 (Korean Head-Tail Tokenization and Part-of-Speech Tagging by using Deep Learning)

  • 김정민;강승식;김혁만
    • 대한임베디드공학회논문지
    • /
    • 제17권4호
    • /
    • pp.199-208
    • /
    • 2022
  • Korean is an agglutinative language, and one or more morphemes are combined to form a single word. Part-of-speech tagging method separates each morpheme from a word and attaches a part-of-speech tag. In this study, we propose a new Korean part-of-speech tagging method based on the Head-Tail tokenization technique that divides a word into a lexical morpheme part and a grammatical morpheme part without decomposing compound words. In this method, the Head-Tail is divided by the syllable boundary without restoring irregular deformation or abbreviated syllables. Korean part-of-speech tagger was implemented using the Head-Tail tokenization and deep learning technique. In order to solve the problem that a large number of complex tags are generated due to the segmented tags and the tagging accuracy is low, we reduced the number of tags to a complex tag composed of large classification tags, and as a result, we improved the tagging accuracy. The performance of the Head-Tail part-of-speech tagger was experimented by using BERT, syllable bigram, and subword bigram embedding, and both syllable bigram and subword bigram embedding showed improvement in performance compared to general BERT. Part-of-speech tagging was performed by integrating the Head-Tail tokenization model and the simplified part-of-speech tagging model, achieving 98.99% word unit accuracy and 99.08% token unit accuracy. As a result of the experiment, it was found that the performance of part-of-speech tagging improved when the maximum token length was limited to twice the number of words.

형태소 분석기 사용을 배제한 음절 단위의 한국어 품사 태깅 (Syllable-based POS Tagging without Korean Morphological Analysis)

  • 심광섭
    • 인지과학
    • /
    • 제22권3호
    • /
    • pp.327-345
    • /
    • 2011
  • 본 논문에서는 형태소 분석기를 사용하지 않는 음절 단위의 한국어 품사 태깅 방법론을 제안한다. 기존 연구에서 한국어 품사 태거는 형태소 분석기가 생성한 결과 중에서 문맥에 가장 잘 맞는 형태소/품사 열을 결정하는 데 반하여, 본 논문에서 제안한 방법론에서는 품사열을 결정할 뿐만 아니라 형태소도 생성한다. 398,632 어절의 학습 데이터로 학습을 하고 33,467 어절의 평가 데이터로 성능 평가를 한 결과 어절 단위의 정확도가 96.31%인 것으로 나타났다.

  • PDF

코퍼스 기반 한국어 합성기의 억양 구현 방안 (A Method of Intonation Modeling for Corpus-Based Korean Speech Synthesizer)

  • 김진영;박상언;엄기완;최승호
    • 음성과학
    • /
    • 제7권2호
    • /
    • pp.193-208
    • /
    • 2000
  • This paper describes a multi-step method of intonation modeling for corpus-based Korean speech synthesizer. We selected 1833 sentences considering various syntactic structures and built a corresponding speech corpus uttered by a female announcer. We detected the pitch using laryngograph signals and manually marked the prosodic boundaries on recorded speech, and carried out the tagging of part-of-speech and syntactic analysis on the text. The detected pitch was separated into 3 frequency bands of low, mid, high frequency components which correspond to the baseline, the word tone, and the syllable tone. We predicted them using the CART method and the Viterbi search algorithm with a word-tone-dictionary. In the collected spoken sentences, 1500 sentences were trained and 333 sentences were tested. In the layer of word tone modeling, we compared two methods. One is to predict the word tone corresponding to the mid-frequency components directly and the other is to predict it by multiplying the ratio of the word tone to the baseline by the baseline. The former method resulted in a mean error of 12.37 Hz and the latter in one of 12.41 Hz, similar to each other. In the layer of syllable tone modeling, it resulted in a mean error rate less than 8.3% comparing with the mean pitch, 193.56 Hz of the announcer, so its performance was relatively good.

  • PDF

NB 모델을 이용한 형태소 복원 (Morpheme Recovery Based on Naïve Bayes Model)

  • 김재훈;전길호
    • 정보처리학회논문지B
    • /
    • 제19B권3호
    • /
    • pp.195-200
    • /
    • 2012
  • 한국어는 교착어이어서 형태소 분석 없이 품사 부착이 어려울 뿐 아니라 형태소를 분석할 때 다양한 어형 변화가 복원되어야 한다. 이것은 한국어 형태소 분석의 고질적인 문제 중 하나이며, 주로 규칙을 이용해서 해결한다. 규칙을 이용할 경우 주어진 문맥에 가장 적합한 복원을 어려워 여러 형태의 모호성을 생성하며, 이는 품사 부착에 의해서 해결된다. 본 논문에서는 이 문제를 기계학습 방법(Na$\ddot{i}$ve Bayes 모델)을 이용하여 해결한다. 기계학습 모델의 입력 자질은 어형 변화가 발생하는 주변 음절이며 출력 범주는 복원된 음절이다. ETRI 구문 말뭉치를 이용한 실험에서 제안된 형태소 복원 모델을 사용한 형태소 단위의 품사 부착 성능은 97.5%의 $F_1$점수를 보였으며 이 모델이 형태소 복원에 매우 유용함을 알 수 있었다.