• Title/Summary/Keyword: Unigram

Search Result 26, Processing Time 0.018 seconds

A Comparative Study of Feature Extraction Methods for Authorship Attribution in the Text of Traditional East Asian Medicine with a Focus on Function Words (한의학 고문헌 텍스트에서의 저자 판별 - 기능어의 역할을 중심으로 -)

  • Oh, Junho
    • Journal of Korean Medical classics
    • /
    • v.33 no.2
    • /
    • pp.51-59
    • /
    • 2020
  • Objectives : We would like to study what is the most appropriate "feature" to effectively perform authorship attribution of the text of Traditional East Asian Medicine Methods : The authorship attribution performance of the Support Vector Machine (SVM) was compared by cross validation, depending on whether the function words or content words, single word or collocations, and IDF weights were applied or not, using 'Variorum of the Nanjing' as an experimental Corpus. Results : When using the combination of 'function words/uni-bigram/TF', the performance was best with accuracy of 0.732, and the combination of 'content words/unigram/TFIDF' showed the lowest accuracy of 0.351. Conclusions : This shows the following facts from the authorship attribution of the text of East Asian traditional medicine. First, function words play an important role in comparison to content words. Second, collocations was relatively important in content words, but single words have more important meanings in function words. Third, unlike general text analysis, IDF weighting resulted in worse performance.

Parallel Corpus Filtering and Korean-Optimized Subword Tokenization for Machine Translation (병렬 코퍼스 필터링과 한국어에 최적화된 서브 워드 분절 기법을 이용한 기계번역)

  • Park, Chanjun;kim, Gyeongmin;Lim, Heuiseok
    • Annual Conference on Human and Language Technology
    • /
    • 2019.10a
    • /
    • pp.221-224
    • /
    • 2019
  • 딥러닝을 이용한 Neural Machine Translation(NMT)의 등장으로 기계번역 분야에서 기존의 규칙 기반,통계기반 방식을 압도하는 좋은 성능을 보이고 있다. 본 논문은 기계번역 모델도 중요하지만 무엇보다 중요한 것은 고품질의 학습데이터를 구성하는 일과 전처리라고 판단하여 이에 관련된 다양한 실험을 진행하였다. 인공신경망 기계번역 시스템의 학습데이터 즉 병렬 코퍼스를 구축할 때 양질의 데이터를 확보하는 것이 무엇보다 중요하다. 그러나 양질의 데이터를 구하는 일은 저작권 확보의 문제, 병렬 말뭉치 구축의 어려움, 노이즈 등을 이유로 쉽지 않은 상황이다. 본 논문은 고품질의 학습데이터를 구축하기 위하여 병렬 코퍼스 필터링 기법을 제시한다. 병렬 코퍼스 필터링이란 정제와 다르게 학습 데이터에 부합하지 않다고 판단되며 소스, 타겟 쌍을 함께 삭제 시켜 버린다. 또한 기계번역에서 무엇보다 중요한 단계는 바로 Subword Tokenization 단계이다. 본 논문은 다양한 실험을 통하여 한-영 기계번역에서 가장 높은 성능을 보이는 Subword Tokenization 방법론을 제시한다. 오픈 된 한-영 병렬 말뭉치로 실험을 진행한 결과 병렬 코퍼스 필터링을 진행한 데이터로 만든 모델이 더 좋은 BLEU 점수를 보였으며 본 논문에서 제안하는 형태소 분석 단위 분리를 진행 후 Unigram이 반영된 SentencePiece 모델로 Subword Tokenization를 진행 하였을 시 가장 좋은 성능을 보였다.

  • PDF

A Study on Keyword Spotting System Using Pseudo N-gram Language Model (의사 N-gram 언어모델을 이용한 핵심어 검출 시스템에 관한 연구)

  • 이여송;김주곤;정현열
    • The Journal of the Acoustical Society of Korea
    • /
    • v.23 no.3
    • /
    • pp.242-247
    • /
    • 2004
  • Conventional keyword spotting systems use the connected word recognition network consisted by keyword models and filler models in keyword spotting. This is why the system can not construct the language models of word appearance effectively for detecting keywords in large vocabulary continuous speech recognition system with large text data. In this paper to solve this problem, we propose a keyword spotting system using pseudo N-gram language model for detecting key-words and investigate the performance of the system upon the changes of the frequencies of appearances of both keywords and filler models. As the results, when the Unigram probability of keywords and filler models were set to 0.2, 0.8, the experimental results showed that CA (Correctly Accept for In-Vocabulary) and CR (Correctly Reject for Out-Of-Vocabulary) were 91.1% and 91.7% respectively, which means that our proposed system can get 14% of improved average CA-CR performance than conventional methods in ERR (Error Reduction Rate).

N-gram based Language Model for the QWERTY Keyboard Input Errors in a Touch Screen Environment (터치스크린 환경에서 쿼티 자판 오타 교정을 위한 n-gram 언어 모델)

  • Ong, Yoon Gee;Kang, Seung Shik
    • Smart Media Journal
    • /
    • v.7 no.2
    • /
    • pp.54-59
    • /
    • 2018
  • With the increasing use of touch-enabled mobile devices such as smartphones and tablet PCs, the works are done on desktop computers and smartphones, and tablet PCs perform laptops. However, due to the nature of smart devices that require portability, QWERTY keyboard is densely arranged in a small screen. This is the cause of different typographical errors when using the mechanical QWERTY keyboard. Unlike the mechanical QWERTY keyboard, which has enough space for each button, QWERTY keyboard on the touch screen often has a small area assigned to each button, so that it is often the case that the surrounding buttons are input rather than the button the user intends to press. In this paper, we propose a method to automatically correct the input errors of the QWERTY keyboard in the touch screen environment by using the n-gram language model using the word unigram and the bigram probability.

Part-of-speech Tagging for Hindi Corpus in Poor Resource Scenario

  • Modi, Deepa;Nain, Neeta;Nehra, Maninder
    • Journal of Multimedia Information System
    • /
    • v.5 no.3
    • /
    • pp.147-154
    • /
    • 2018
  • Natural language processing (NLP) is an emerging research area in which we study how machines can be used to perceive and alter the text written in natural languages. We can perform different tasks on natural languages by analyzing them through various annotational tasks like parsing, chunking, part-of-speech tagging and lexical analysis etc. These annotational tasks depend on morphological structure of a particular natural language. The focus of this work is part-of-speech tagging (POS tagging) on Hindi language. Part-of-speech tagging also known as grammatical tagging is a process of assigning different grammatical categories to each word of a given text. These grammatical categories can be noun, verb, time, date, number etc. Hindi is the most widely used and official language of India. It is also among the top five most spoken languages of the world. For English and other languages, a diverse range of POS taggers are available, but these POS taggers can not be applied on the Hindi language as Hindi is one of the most morphologically rich language. Furthermore there is a significant difference between the morphological structures of these languages. Thus in this work, a POS tagger system is presented for the Hindi language. For Hindi POS tagging a hybrid approach is presented in this paper which combines "Probability-based and Rule-based" approaches. For known word tagging a Unigram model of probability class is used, whereas for tagging unknown words various lexical and contextual features are used. Various finite state machine automata are constructed for demonstrating different rules and then regular expressions are used to implement these rules. A tagset is also prepared for this task, which contains 29 standard part-of-speech tags. The tagset also includes two unique tags, i.e., date tag and time tag. These date and time tags support all possible formats. Regular expressions are used to implement all pattern based tags like time, date, number and special symbols. The aim of the presented approach is to increase the correctness of an automatic Hindi POS tagging while bounding the requirement of a large human-made corpus. This hybrid approach uses a probability-based model to increase automatic tagging and a rule-based model to bound the requirement of an already trained corpus. This approach is based on very small labeled training set (around 9,000 words) and yields 96.54% of best precision and 95.08% of average precision. The approach also yields best accuracy of 91.39% and an average accuracy of 88.15%.

Automatic Word Spacing of the Korean Sentences by Using End-to-End Deep Neural Network (종단 간 심층 신경망을 이용한 한국어 문장 자동 띄어쓰기)

  • Lee, Hyun Young;Kang, Seung Shik
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.8 no.11
    • /
    • pp.441-448
    • /
    • 2019
  • Previous researches on automatic spacing of Korean sentences has been researched to correct spacing errors by using n-gram based statistical techniques or morpheme analyzer to insert blanks in the word boundary. In this paper, we propose an end-to-end automatic word spacing by using deep neural network. Automatic word spacing problem could be defined as a tag classification problem in unit of syllable other than word. For contextual representation between syllables, Bi-LSTM encodes the dependency relationship between syllables into a fixed-length vector of continuous vector space using forward and backward LSTM cell. In order to conduct automatic word spacing of Korean sentences, after a fixed-length contextual vector by Bi-LSTM is classified into auto-spacing tag(B or I), the blank is inserted in the front of B tag. For tag classification method, we compose three types of classification neural networks. One is feedforward neural network, another is neural network language model and the other is linear-chain CRF. To compare our models, we measure the performance of automatic word spacing depending on the three of classification networks. linear-chain CRF of them used as classification neural network shows better performance than other models. We used KCC150 corpus as a training and testing data.