• Title/Summary/Keyword: POS-Tagging

Search Result 73, Processing Time 0.019 seconds

Sequence-to-sequence based Morphological Analysis and Part-Of-Speech Tagging for Korean Language with Convolutional Features (Sequence-to-sequence 기반 한국어 형태소 분석 및 품사 태깅)

  • Li, Jianri;Lee, EuiHyeon;Lee, Jong-Hyeok
    • Journal of KIISE
    • /
    • v.44 no.1
    • /
    • pp.57-62
    • /
    • 2017
  • Traditional Korean morphological analysis and POS tagging methods usually consist of two steps: 1 Generat hypotheses of all possible combinations of morphemes for given input, 2 Perform POS tagging search optimal result. require additional resource dictionaries and step could error to the step. In this paper, we tried to solve this problem end-to-end fashion using sequence-to-sequence model convolutional features. Experiment results Sejong corpus sour approach achieved 97.15% F1-score on morpheme level, 95.33% and 60.62% precision on word and sentence level, respectively; s96.91% F1-score on morpheme level, 95.40% and 60.62% precision on word and sentence level, respectively.

A Survey of Machine Translation and Parts of Speech Tagging for Indian Languages

  • Khedkar, Vijayshri;Shah, Pritesh
    • International Journal of Computer Science & Network Security
    • /
    • v.22 no.4
    • /
    • pp.245-253
    • /
    • 2022
  • Commenced in 1954 by IBM, machine translation has expanded immensely, particularly in this period. Machine translation can be broken into seven main steps namely- token generation, analyzing morphology, lexeme, tagging Part of Speech, chunking, parsing, and disambiguation in words. Morphological analysis plays a major role when translating Indian languages to develop accurate parts of speech taggers and word sense. The paper presents various machine translation methods used by different researchers for Indian languages along with their performance and drawbacks. Further, the paper concentrates on parts of speech (POS) tagging in Marathi dialect using various methods such as rule-based tagging, unigram, bigram, and more. After careful study, it is concluded that for machine translation, parts of speech tagging is a major step. Also, for the Marathi language, the Hidden Markov Model gives the best results for parts of speech tagging with an accuracy of 93% which can be further improved according to the dataset.

A knowledge-based pronunciation generation system for French (지식 기반 프랑스어 발음열 생성 시스템)

  • Kim, Sunhee
    • Phonetics and Speech Sciences
    • /
    • v.10 no.1
    • /
    • pp.49-55
    • /
    • 2018
  • This paper aims to describe a knowledge-based pronunciation generation system for French. It has been reported that a rule-based pronunciation generation system outperforms most of the data-driven ones for French; however, only a few related studies are available due to existing language barriers. We provide basic information about the French language from the point of view of the relationship between orthography and pronunciation, and then describe our knowledge-based pronunciation generation system, which consists of morphological analysis, Part-of-Speech (POS) tagging, grapheme-to-phoneme generation, and phone-to-phone generation. The evaluation results show that the word error rate of POS tagging, based on a sample of 1,000 sentences, is 10.70% and that of phoneme generation, using 130,883 entries, is 2.70%. This study is expected to contribute to the development and evaluation of speech synthesis or speech recognition systems for French.

Measuring Reliability of POS Tagging Systems (품사 태깅 시스템의 신뢰도 측정)

  • Kim, Jae-Hun
    • The KIPS Transactions:PartB
    • /
    • v.8B no.4
    • /
    • pp.365-372
    • /
    • 2001
  • 본 논문에서는 품사 태깅 시스템에서 신뢰도 측정 방법에 대해서 기술한다. 품사 태깅 시스템의 신뢰도는 품사 태깅 결과에 오류가 포함되지 않을 확률이다. 일반적으로 신뢰도 측정은 오류확률에 기반한다. 정확한 오류확률을 추정하기 위해서는 일반적으로 품사 태깅 시스템에서 사용되는 말뭉치보다 훨씬 더 많은 양의 말뭉치가 필요하다. 이 문제를 다소 완화시키기 위해서, 본 논문에서는 좀더 정확한 오류확률 추정하기 위해 교차확인 방법을 이용한다. 본 논문에서 사용된 품사 태깅 시스템은 시험말뭉치에 대해서 61%의 신뢰도를 보였다. 이는 한국어 문장의 형태소 수가 평균 20개이고, 품사 태깅 시스템의 정확률이 97.5%일 때의 신뢰도에 해당한다. 본 논문에서 사용된 품사 태깅 시스템이 미등록어가 없을 경우에 97.68%의 정확률을 보이므로 제안된 신뢰도 측정 방법이 어느 정도 타당함을 알 수 있었다. 제안된 신뢰도 측정 방법은 구문분석, 정보검색 등 여러 분야에 응용이 가능할 것이며, 본 논문에서는 품사태깅의 오류검출에 적용해보았다.

  • PDF

Chinese Segmentation and POS-Tagging by Automat ic POS Dictionary Training (품사 사전 자동 학습을 통한 중국어 단어 분할 및 품사 태깅)

  • Ha, Ju-Hong;Zheng, Yu;Lee, Gary G.
    • Annual Conference on Human and Language Technology
    • /
    • 2002.10e
    • /
    • pp.33-39
    • /
    • 2002
  • 중국어의 품사 태깅(part-of-speech tagging)을 위해서는 중국어 문장들은 내부 단어간의 명확한 분리가 없기 때문에 단어 분할(word segmentation)과 품사 태깅을 동시에 처리해야 한다. 본 논문은 규칙 기반(rule base)과 사전 기반(dictionary base) 기법을 혼합하여 구현한 단어 분할 시스템을 사용하여 입력 문장을 단어 단위로 분할하고, HMM(hidden Markov model) 기반 통계적 품사 태깅 기법을 사용한다. 특히, 본 논문에서는 주어진 말뭉치(corpus)로부터 자동 학습(automatic training)을 통해 품사 사전을 구축하여 구현된 시스템과 말뭉치간의 독립성을 유지한다. 말뭉치는 중국어 간체와 번체 모두를 대상으로 하고, 각 말뭉치로부터 자동 학습을 통해 얻어진 품사 사전으로 단어 분할과 품사 태깅을 한다. 실험결과들은 간체, 번체 각각의 단어 분할 성능과 품사 태깅 성능을 보여준다.

  • PDF

Performance Comparison Analysis on Named Entity Recognition system with Bi-LSTM based Multi-task Learning (다중작업학습 기법을 적용한 Bi-LSTM 개체명 인식 시스템 성능 비교 분석)

  • Kim, GyeongMin;Han, Seunggnyu;Oh, Dongsuk;Lim, HeuiSeok
    • Journal of Digital Convergence
    • /
    • v.17 no.12
    • /
    • pp.243-248
    • /
    • 2019
  • Multi-Task Learning(MTL) is a training method that trains a single neural network with multiple tasks influences each other. In this paper, we compare performance of MTL Named entity recognition(NER) model trained with Korean traditional culture corpus and other NER model. In training process, each Bi-LSTM layer of Part of speech tagging(POS-tagging) and NER are propagated from a Bi-LSTM layer to obtain the joint loss. As a result, the MTL based Bi-LSTM model shows 1.1%~4.6% performance improvement compared to single Bi-LSTM models.

A Semi-supervised Learning of HMM to Build a POS Tagger for a Low Resourced Language

  • Pattnaik, Sagarika;Nayak, Ajit Kumar;Patnaik, Srikanta
    • Journal of information and communication convergence engineering
    • /
    • v.18 no.4
    • /
    • pp.207-215
    • /
    • 2020
  • Part of speech (POS) tagging is an indispensable part of major NLP models. Its progress can be perceived on number of languages around the globe especially with respect to European languages. But considering Indian Languages, it has not got a major breakthrough due lack of supporting tools and resources. Particularly for Odia language it has not marked its dominancy yet. With a motive to make the language Odia fit into different NLP operations, this paper makes an attempt to develop a POS tagger for the said language on a HMM (Hidden Markov Model) platform. The tagger judiciously considers bigram HMM with dynamic Viterbi algorithm to give an output annotated text with maximum accuracy. The model is experimented on a corpus belonging to tourism domain accounting to a size of approximately 0.2 million tokens. With the proportion of training and testing as 3:1, the proposed model exhibits satisfactory result irrespective of limited training size.

A Rule-Based Analysis from Raw Korean Text to Morphologically Annotated Corpora

  • Lee, Ki-Yong;Markus Schulze
    • Language and Information
    • /
    • v.6 no.2
    • /
    • pp.105-128
    • /
    • 2002
  • Morphologically annotated corpora are the basis for many tasks of computational linguistics. Most current approaches use statistically driven methods of morphological analysis, that provide just POS-tags. While this is sufficient for some applications, a rule-based full morphological analysis also yielding lemmatization and segmentation is needed for many others. This work thus aims at 〔1〕 introducing a rule-based Korean morphological analyzer called Kormoran based on the principle of linearity that prohibits any combination of left-to-right or right-to-left analysis or backtracking and then at 〔2〕 showing how it on be used as a POS-tagger by adopting an ordinary technique of preprocessing and also by filtering out irrelevant morpho-syntactic information in analyzed feature structures. It is shown that, besides providing a basis for subsequent syntactic or semantic processing, full morphological analyzers like Kormoran have the greater power of resolving ambiguities than simple POS-taggers. The focus of our present analysis is on Korean text.

  • PDF

Korean Head-Tail Tokenization and Part-of-Speech Tagging by using Deep Learning (딥러닝을 이용한 한국어 Head-Tail 토큰화 기법과 품사 태깅)

  • Kim, Jungmin;Kang, Seungshik;Kim, Hyeokman
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.17 no.4
    • /
    • pp.199-208
    • /
    • 2022
  • Korean is an agglutinative language, and one or more morphemes are combined to form a single word. Part-of-speech tagging method separates each morpheme from a word and attaches a part-of-speech tag. In this study, we propose a new Korean part-of-speech tagging method based on the Head-Tail tokenization technique that divides a word into a lexical morpheme part and a grammatical morpheme part without decomposing compound words. In this method, the Head-Tail is divided by the syllable boundary without restoring irregular deformation or abbreviated syllables. Korean part-of-speech tagger was implemented using the Head-Tail tokenization and deep learning technique. In order to solve the problem that a large number of complex tags are generated due to the segmented tags and the tagging accuracy is low, we reduced the number of tags to a complex tag composed of large classification tags, and as a result, we improved the tagging accuracy. The performance of the Head-Tail part-of-speech tagger was experimented by using BERT, syllable bigram, and subword bigram embedding, and both syllable bigram and subword bigram embedding showed improvement in performance compared to general BERT. Part-of-speech tagging was performed by integrating the Head-Tail tokenization model and the simplified part-of-speech tagging model, achieving 98.99% word unit accuracy and 99.08% token unit accuracy. As a result of the experiment, it was found that the performance of part-of-speech tagging improved when the maximum token length was limited to twice the number of words.

Lattice-based Discriminative Approach for Korean Morphological Analysis (래티스상의 구조적 분류에 기반한 한국어 형태소 분석 및 품사 태깅)

  • Na, Seung-Hoon;Kim, Chang-Hyun;Kim, Young-Kil
    • Journal of KIISE:Software and Applications
    • /
    • v.41 no.7
    • /
    • pp.523-532
    • /
    • 2014
  • In this paper, we propose a lattice-based discriminative approach for Korean morphological analysis and POS tagging. In our approach, for an input sentence, a morpheme lattice is first created from a lexicon where each node corresponds to a morpheme in the lexicon and each edge is formed between two consecutive morphemes. A candidate result of morphological analysis is then represented as a path in the morpheme lattice which is defined as the sequence of edges, starting in the initial state and ending with the final state. In this setting, the morphological analysis is simply considered as the process of finding the best path among all possible paths. Experiment results show that the proposed lattice-based method outperforms the first-order linear-chain CRF.