• Title/Summary/Keyword: Tokenization

Search Result 32, Processing Time 0.021 seconds

Korean Part-Of-Speech Tagging by using Head-Tail Tokenization (Head-Tail 토큰화 기법을 이용한 한국어 품사 태깅)

  • Suh, Hyun-Jae;Kim, Jung-Min;Kang, Seung-Shik
    • Smart Media Journal
    • /
    • v.11 no.5
    • /
    • pp.17-25
    • /
    • 2022
  • Korean part-of-speech taggers decompose a compound morpheme into unit morphemes and attach part-of-speech tags. So, here is a disadvantage that part-of-speech for morphemes are over-classified in detail and complex word types are generated depending on the purpose of the taggers. When using the part-of-speech tagger for keyword extraction in deep learning based language processing, it is not required to decompose compound particles and verb-endings. In this study, the part-of-speech tagging problem is simplified by using a Head-Tail tokenization technique that divides only two types of tokens, a lexical morpheme part and a grammatical morpheme part that the problem of excessively decomposed morpheme was solved. Part-of-speech tagging was attempted with a statistical technique and a deep learning model on the Head-Tail tokenized corpus, and the accuracy of each model was evaluated. Part-of-speech tagging was implemented by TnT tagger, a statistical-based part-of-speech tagger, and Bi-LSTM tagger, a deep learning-based part-of-speech tagger. TnT tagger and Bi-LSTM tagger were trained on the Head-Tail tokenized corpus to measure the part-of-speech tagging accuracy. As a result, it showed that the Bi-LSTM tagger performs part-of-speech tagging with a high accuracy of 99.52% compared to 97.00% for the TnT tagger.

A Study of Analysis and Response and Plan for National and International Security Practices using Fin-Tech Technologies (핀테크 금융 기술을 이용한 국내외 보안 사례 분석 및 대응 방안에 대한 연구)

  • Shin, Seung-Soo;Jeong, Yoon-Su;An, Yu-Jin
    • Journal of Convergence Society for SMB
    • /
    • v.5 no.3
    • /
    • pp.1-7
    • /
    • 2015
  • Recently, finance technology related to Fin-Tech has emerged while national and international financial incidents have increased. Security technologies that are currently operated in the financial institutions, have been reported to be vulnerable to security attacks. In this paper, we propose a method of response and plan of security incident using Fin-Tech technology in the divers authentication methods and the usage of biometrics. Proposed method provides a convenient banking services to the users by integrating IT technology, such as personal asset management, crowdfunding to finance technology. Also, the proposed method may provide the security with ease by applying the security technologies such as PCI-DSS, tokenization technique, FDS, the block chain. Proposed method analyzes a number of security cases in relation to the Fin-Tech, financial technologies, for a response.

  • PDF

Policy-based performance comparison study of Real-time Simultaneous Translation (실시간 동시통번역의 정책기반 성능 비교 연구)

  • Lee, Jungseob;Moon, Hyeonseok;Park, Chanjun;Seo, Jaehyung;Eo, Sugyeong;Lee, Seungjun;Koo, Seonmin;Lim, Heuiseok
    • Journal of the Korea Convergence Society
    • /
    • v.13 no.3
    • /
    • pp.43-54
    • /
    • 2022
  • Simultaneous translation is online decoding to translates with only subsentence. The goal of simultaneous translation research is to improve translation performance against delay. For this reason, most studies find trade-off performance between delays. We studied the experiments of the fixed policy-based simultaneous translation in Korean. Our experiments suggest that Korean tokenization causes many fragments, resulting in delay compared to other languages. We suggest follow-up studies such as n-gram tokenization to solve the problems.

Design of MD Authentication and Privacy for Mobile Micro-payment based on NFC (NFC 기반 모바일 소액 결제를 위한 MD 인증과 프라이버시 설계)

  • Kim, Yong-Il;Kim, Dae-Gue;Cha, Byung-Rae
    • Journal of Advanced Navigation Technology
    • /
    • v.17 no.1
    • /
    • pp.47-55
    • /
    • 2013
  • In this paper, we propose the micropayment model based on NFC, authentication, and privacy technique to support micro-payment in aspect of information technology to reinvigorate the traditional market. The micropayment model supports facilities of payment using smart phone based on NFC, and the encryption and tokenization support the functions of MD authentication, indirection authentication, and privacy of user's payment.

A Methodology for Urdu Word Segmentation using Ligature and Word Probabilities

  • Khan, Yunus;Nagar, Chetan;Kaushal, Devendra S.
    • International Journal of Ocean System Engineering
    • /
    • v.2 no.1
    • /
    • pp.24-31
    • /
    • 2012
  • This paper introduce a technique for Word segmentation for the handwritten recognition of Urdu script. Word segmentation or word tokenization is a primary technique for understanding the sentences written in Urdu language. Several techniques are available for word segmentation in other languages but not much work has been done for word segmentation of Urdu Optical Character Recognition (OCR) System. A method is proposed for word segmentation in this paper. It finds the boundaries of words in a sequence of ligatures using probabilistic formulas, by utilizing the knowledge of collocation of ligatures and words in the corpus. The word identification rate using this technique is 97.10% with 66.63% unknown words identification rate.

KoRIBES : A Study on the Problems of RIBES in Automatic Evaluation English-Korean Patent Machine Translation (특허 기계 번역에 대한 RIBES 한국어 자동평가 문제에 대한 고찰)

  • Jang, Hyeon-Jin;Jang, Moon-Seok;Noh, Han-Sung
    • Annual Conference on Human and Language Technology
    • /
    • 2020.10a
    • /
    • pp.543-547
    • /
    • 2020
  • 자연어 처리에서 기계번역은 가장 많이 사용되고 빠르게 발전하고 있다. 기계번역에 있어서 사람의 평가가 가장 정확하고 중요하지만 많은 시간과 비용이 발생된다. 이에 기계번역을 자동 평가하는 방법들이 많이 제안되어 사용되고 있지만, 한국어 특성을 잘 반영한 자동평가 방법은 연구되지 않고 있다. BLEU와 같은 자동평가 방법을 많이 사용하고 있지만 언어의 특성 차이로 인해 원하는 평가결과를 얻지 못하는 경우가 발생하며, 특히 특허나 논문과 같은 기술문서의 번역에서는 더 많이 발생한다. 이에 본 논문에서는 단어의 정밀도와 어순이 평가에 영향이 있는 RIBES를 가지고 특허 기계 번역에서 영어→한국어로 기계 번역된 결과물의 자동평가에 대해 사람의 평가와 유사한 결과를 얻기 위해 tokenization 과정에서 복합 형태소 분리를 통한 평가방법을 제안하고자 한다.

  • PDF

NFT Tokenization of Real Estate and Divisible FT Trading with Asset Portfolio Management (부동산 소유권 NFT 와 분할 판매 및 거래 시스템 설계)

  • Kim, Young-Gun;Kim, Seong-Whan;Song, Hyo Jung
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2022.11a
    • /
    • pp.258-260
    • /
    • 2022
  • 대체 불가능 토큰 (NFT, non-fungible token)은 고유하고 더 이상 분할할 수 없는 특성을 가지고 있다. NFT 는 디지털 콘텐츠에 대한 소유권을 증명해 주지만 현재 1) 소유권 증명 이상의 유틸리티가 명확하지 않고, 2) 토큰이지만 유동성이 거의 없으며, 3) 가격이 예측 불가능하다. 특히, 부동산의 경우 가격이 매우 높은 특징으로 인하여 투자 진입장벽이 매우 높다. NFT 분할을 하면 유동성의 증가, 그리고 접근성 증가에 따른 커뮤니티 볼륨의 증가를 기대해 볼 수 있다. 이러한 특성을 활용하여 기존에 투자하기 어려웠던 부동산을 다양한 기술을 활용하여 쉽게 투자를 할 수 있게 된다. 또한, Black Litterman 모델을 활용하여 보다 여러 종류의 NFT 들에 대한 최적 포트폴리오를 구성할 수 있는 알고리즘을 설계하고 구현하였다.

Comparison of Word Extraction Methods Based on Unsupervised Learning for Analyzing East Asian Traditional Medicine Texts (한의학 고문헌 텍스트 분석을 위한 비지도학습 기반 단어 추출 방법 비교)

  • Oh, Junho
    • Journal of Korean Medical classics
    • /
    • v.32 no.3
    • /
    • pp.47-57
    • /
    • 2019
  • Objectives : We aim to assist in choosing an appropriate method for word extraction when analyzing East Asian Traditional Medical texts based on unsupervised learning. Methods : In order to assign ranks to substrings, we conducted a test using one method(BE:Branching Entropy) for exterior boundary value, three methods(CS:cohesion score, TS:t-score, SL:simple-ll) for interior boundary value, and six methods(BExSL, BExTS, BExCS, CSxTS, CSxSL, TSxSL) from combining them. Results : When Miss Rate(MR) was used as the criterion, the error was minimal when the TS and SL were used together, while the error was maximum when CS was used alone. When number of segmented texts was applied as weight value, the results were the best in the case of SL, and the worst in the case of BE alone. Conclusions : Unsupervised-Learning-Based Word Extraction is a method that can be used to analyze texts without a prepared set of vocabulary data. When using this method, SL or the combination of SL and TS could be considered primarily.

Phrase-Chunk Level Hierarchical Attention Networks for Arabic Sentiment Analysis

  • Abdelmawgoud M. Meabed;Sherif Mahdy Abdou;Mervat Hassan Gheith
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.9
    • /
    • pp.120-128
    • /
    • 2023
  • In this work, we have presented ATSA, a hierarchical attention deep learning model for Arabic sentiment analysis. ATSA was proposed by addressing several challenges and limitations that arise when applying the classical models to perform opinion mining in Arabic. Arabic-specific challenges including the morphological complexity and language sparsity were addressed by modeling semantic composition at the Arabic morphological analysis after performing tokenization. ATSA proposed to perform phrase-chunks sentiment embedding to provide a broader set of features that cover syntactic, semantic, and sentiment information. We used phrase structure parser to generate syntactic parse trees that are used as a reference for ATSA. This allowed modeling semantic and sentiment composition following the natural order in which words and phrase-chunks are combined in a sentence. The proposed model was evaluated on three Arabic corpora that correspond to different genres (newswire, online comments, and tweets) and different writing styles (MSA and dialectal Arabic). Experiments showed that each of the proposed contributions in ATSA was able to achieve significant improvement. The combination of all contributions, which makes up for the complete ATSA model, was able to improve the classification accuracy by 3% and 2% on Tweets and Hotel reviews datasets, respectively, compared to the existing models.

Development and Evaluation of Information Extraction Module for Postal Address Information (우편주소정보 추출모듈 개발 및 평가)

  • Shin, Hyunkyung;Kim, Hyunseok
    • Journal of Creative Information Culture
    • /
    • v.5 no.2
    • /
    • pp.145-156
    • /
    • 2019
  • In this study, we have developed and evaluated an information extracting module based on the named entity recognition technique. For the given purpose in this paper, the module was designed to apply to the problem dealing with extraction of postal address information from arbitrary documents without any prior knowledge on the document layout. From the perspective of information technique practice, our approach can be said as a probabilistic n-gram (bi- or tri-gram) method which is a generalized technique compared with a uni-gram based keyword matching. It is the main difference between our approach and the conventional methods adopted in natural language processing that applying sentence detection, tokenization, and POS tagging recursively rather than applying the models sequentially. The test results with approximately two thousands documents are presented at this paper.