• Title/Summary/Keyword: Tagger

Search Result 62, Processing Time 0.022 seconds

Robust Part-of-Speech Tagger using Statistical and Rule-based Approach (통계와 규칙을 이용한 강인한 품사 태거)

  • Shim, Jun-Hyuk;Kim, Jun-Seok;Cha, Jong-Won;Lee, Geun-Bae
    • Annual Conference on Human and Language Technology
    • /
    • 1999.10d
    • /
    • pp.60-75
    • /
    • 1999
  • 품사 태깅은 자연 언어 처리의 가장 기본이 되는 부분으로 상위 자연 언어 처리 부분인 구문 분석, 의미 분석의 전처리로 사용되고, 독립된 응용으로 언어의 정보를 추출하거나 정보 검색 등의 응용에 사용되어 진다. 품사 태깅은 크게 통계에 기반한 방법, 규칙에 기반한 방법, 이 둘을 모두 이용하는 혼합형 방법 등으로 나누어 연구되고 있다. 포항공대 자연언어처리 연구실의 자연 언어 처리 엔진(SKOPE)의 품사 태깅 시스템 POSTAG는 미등록어 추정이 강화된 혼합형 품사 태깅 시스템이다 본 시스템은 형태소 분석기, 통계적 품사 태거, 에러 수정 규칙 후처리기로 구성되어 있다. 이들은 각각 단순히 직렬 연결되어 있는 것이 아니라 형태소 접속 테이블을 기준으로 분석 과정에서 형태소 접속 그래프를 생성하고 처리하면서 상호 밀접한 연관을 가진다. 그리고, 미등록어용 패턴사전에 의해 등록어와 동일한 방법으로 미등록어를 처리함으로써 효율적이고 강건한 품사 태깅을 한다. 한편, POSTAG에서 사용되는 태그세트와 한국전자통신연구원(ETRI)의 표준 태그세트 간에 양방향으로 태그세트 매핑을 함으로써, 표준 태그세트로 태깅된 코퍼스로부터 POSTAC를 위한 대용량 학습자료를 얻고 POSTAG에서 두 가지 태그세트로 품사 태깅 결과 출력이 가능하다. 본 시스템은 MATEC '99'에서 제공된 30000어절에 대하여 표준 태그세트로 출력한 결과 95%의 형태소단위 정확률을 보였으며, 태그세트 매핑을 제외한 POSTAG의 품사 태깅 결과 97%의 정확률을 보였다.

  • PDF

Korean Noun Extractor using Occurrence Patterns of Nouns and Post-noun Morpheme Sequences (한국어 명사 출현 특성과 후절어를 이용한 명사추출기)

  • Park, Yong-Hyun;Hwang, Jae-Won;Ko, Young-Joong
    • Journal of KIISE:Software and Applications
    • /
    • v.37 no.12
    • /
    • pp.919-927
    • /
    • 2010
  • Since the performance of mobile devices is recently improved, the requirement of information retrieval is increased in the mobile devices as well as PCs. If a mobile device with small memory uses a tradition language analysis tool to extract nouns from korean texts, it will impose a burden of analysing language. As a result, the need for the language analysis tools adequate to the mobile devices is increasing. Therefore, this paper proposes a new method for noun extraction using post-noun morpheme sequences and noun patterns from a large corpus. The proposed noun extractor has only the dictionary capacity of 146KB and its performance shows 0.86 $F_1$-measure; the capacity of noun dictionary corresponds to only the 4% capacity of the existing noun extractor with a POS tagger. In addition, it easily extract nouns for unknown word because its dependence for noun dictionaries is low.

Categorization of Korean News Articles Based on Convolutional Neural Network Using Doc2Vec and Word2Vec (Doc2Vec과 Word2Vec을 활용한 Convolutional Neural Network 기반 한국어 신문 기사 분류)

  • Kim, Dowoo;Koo, Myoung-Wan
    • Journal of KIISE
    • /
    • v.44 no.7
    • /
    • pp.742-747
    • /
    • 2017
  • In this paper, we propose a novel approach to improve the performance of the Convolutional Neural Network(CNN) word embedding model on top of word2vec with the result of performing like doc2vec in conducting a document classification task. The Word Piece Model(WPM) is empirically proven to outperform other tokenization methods such as the phrase unit, a part-of-speech tagger with substantial experimental evidence (classification rate: 79.5%). Further, we conducted an experiment to classify ten categories of news articles written in Korean by feeding words and document vectors generated by an application of WPM to the baseline and the proposed model. From the results of the experiment, we report the model we proposed showed a higher classification rate (89.88%) than its counterpart model (86.89%), achieving a 22.80% improvement. Throughout this research, it is demonstrated that applying doc2vec in the document classification task yields more effective results because doc2vec generates similar document vector representation for documents belonging to the same category.

Development of a Data Reduction Algorithm for Optical Wide Field Patrol (OWL) II: Improving Measurement of Lengths of Detected Streaks

  • Park, Sun-Youp;Choi, Jin;Roh, Dong-Goo;Park, Maru;Jo, Jung Hyun;Yim, Hong-Suh;Park, Young-Sik;Bae, Young-Ho;Park, Jang-Hyun;Moon, Hong-Kyu;Choi, Young-Jun;Cho, Sungki;Choi, Eun-Jung
    • Journal of Astronomy and Space Sciences
    • /
    • v.33 no.3
    • /
    • pp.221-227
    • /
    • 2016
  • As described in the previous paper (Park et al. 2013), the detector subsystem of optical wide-field patrol (OWL) provides many observational data points of a single artificial satellite or space debris in the form of small streaks, using a chopper system and a time tagger. The position and the corresponding time data are matched assuming that the length of a streak on the CCD frame is proportional to the time duration of the exposure during which the chopper blades do not obscure the CCD window. In the previous study, however, the length was measured using the diagonal of the rectangle of the image area containing the streak; the results were quite ambiguous and inaccurate, allowing possible matching error of positions and time data. Furthermore, because only one (position, time) data point is created from one streak, the efficiency of the observation decreases. To define the length of a streak correctly, it is important to locate the endpoints of a streak. In this paper, a method using a differential convolution mask pattern is tested. This method can be used to obtain the positions where the pixel values are changed sharply. These endpoints can be regarded as directly detected positional data, and the number of data points is doubled by this result.

Functional Expansion of Morphological Analyzer Based on Longest Phrase Matching For Efficient Korean Parsing (효율적인 한국어 파싱을 위한 최장일치 기반의 형태소 분석기 기능 확장)

  • Lee, Hyeon-yoeng;Lee, Jong-seok;Kang, Byeong-do;Yang, Seung-weon
    • Journal of Digital Contents Society
    • /
    • v.17 no.3
    • /
    • pp.203-210
    • /
    • 2016
  • Korean is free of omission of sentence elements and modifying scope, so managing it on morphological analyzer is better than parser. In this paper, we propose functional expansion methods of the morphological analyzer to ease the burden of parsing. This method is a longest phrase matching method. When the series of several morpheme have one syntax category by processing of Unknown-words, Compound verbs, Compound nouns, Numbers and Symbols, our method combines them into a syntactic unit. And then, it is to treat by giving them a semantic features as syntax unit. The proposed morphological analysis method removes unnecessary morphological ambiguities and deceases results of morphological analysis, so improves accuracy of tagger and parser. By empirical results, we found that our method deceases 73.4% of Parsing tree and 52.4% of parsing time on average.

Effectiveness of endodontic retreatment using WaveOne Primary files in reciprocating and rotary motions

  • Patricia Marton Costa;Renata Maira de Souza Leal;Guilherme Hiroshi Yamanari;Bruno Cavalini Cavenago;Marco Antonio Hungaro Duarte
    • Restorative Dentistry and Endodontics
    • /
    • v.48 no.2
    • /
    • pp.15.1-15.7
    • /
    • 2023
  • Objectives: This study evaluated the efficiency of WaveOne Primary files (Dentsply Sirona) for removing root canal fillings with 2 types of movement: reciprocating (RCP) and continuous counterclockwise rotation (CCR). Materials and Methods: Twenty mandibular incisors were prepared with a RCP instrument (25.08) and filled using the Tagger hybrid obturation technique. The teeth were retreated with a WaveOne Primary file and randomly allocated to 2 experimental retreatment groups (n = 10) according to movement type: RCP and CCR. The root canals were emptied of filling material in the first 3 steps of insertion, until reaching the working length. The timing of retreatment and procedure errors were recorded for all samples. The specimens were scanned before and after the retreatment procedure with micro-computed tomography to calculate the percentage and volume (mm3) of the residual filling material. The results were statistically evaluated using paired and independent t-tests, with a significance level set at 5%. Results: No significant difference was found in the timing of filling removal between the groups, with a mean of 322 seconds (RCP) and 327 seconds (CCR) (p < 0.05). There were 6 instrument fractures: 1 in a RCP motion file and 5 in continuous rotation files. The volumes of residual filling material were similar (9.94% for RCP and 15.94% for CCR; p > 0.05). Conclusions: The WaveOne Primary files used in retreatment performed similarly in both RCP and CCR movements. Neither movement type completely removed the obturation material, but the RCP movement provided greater safety.

HMM Based Part of Speech Tagging for Hadith Isnad

  • Abdelkarim Abdelkader
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.3
    • /
    • pp.151-160
    • /
    • 2023
  • The Hadith is the second source of Islamic jurisprudence after Qur'an. Both sources are indispensable for muslims to practice Islam. All Ahadith are collected and are written. But most books of Hadith contain Ahadith that can be weak or rejected. So, quite a long time, scholars of Hadith have defined laws, rules and principles of Hadith to know the correct Hadith (Sahih) from the fair (Hassen) and weak (Dhaif). Unfortunately, the application of these rules, laws and principles is done manually by the specialists or students until now. The work presented in this paper is part of the automatic treatment of Hadith, and more specifically, it aims to automatically process the chain of narrators (Hadith Isnad) to find its different components and affect for each component its own tag using a statistical method: the Hidden Markov Models (HMM). This method is a power abstraction for times series data and a robust tool for representing probability distributions over sequences of observations. In this paper, we describe an important tool in the Hadith isnad processing: A chunker with HMM. The role of this tool is to decompose the chain of narrators (Isnad) and determine the tag of each part of Isnad (POI). First, we have compiled a tagset containing 13 tags. Then, we have used these tags to manually conceive a corpus of 100 chains of narrators from "Sahih Alboukhari" and we have extracted a lexicon from this corpus. This lexicon is a set of XML documents based on HPSG features and it contains the information of 134 narrators. After that, we have designed and implemented an analyzer based on HMM that permit to assign for each part of Isnad its proper tag and for each narrator its features. The system was tested on 2661 not duplicated Isnad from "Sahih Alboukhari". The obtained result achieved F-scores of 93%.

Postoperative pain after endodontic treatment of necrotic teeth with large intentional foraminal enlargement

  • Ricardo Machado;Daniel Comparin;Sergio Aparecido Ignacio;Ulisses Xavier da Silva Neto
    • Restorative Dentistry and Endodontics
    • /
    • v.46 no.3
    • /
    • pp.31.1-31.13
    • /
    • 2021
  • Objectives: To evaluate postoperative pain after endodontic treatment of necrotic teeth using large intentional foraminal enlargement (LIFE). Materials and Methods: The sample included 60 asymptomatic necrotic teeth (with or without chronic apical periodontitis), and a periodontal probing depth of 3 mm, previously accessed and referred to perform endodontic treatment. After previous procedures, the position and approximate size of the apical foramen (AF) were determined by using an apex locator and K flexo-files, respectively. The chemomechanical preparation was performed with Profile 04 files 2 mm beyond the AF to achieve the LIFE, using 2.5 mL of 2.5% NaOCl at each file change. The filling was performed by Tagger's hybrid technique and EndoFill sealer. Phone calls were made to all the patients at 24, 48 and 72 hours after treatment, to classify postoperative pain. Statistical analysis was performed by different tests with a significance level of 5%. Results: Age, gender, periradicular status and tooth type did not influence postoperative pain (p > 0.05). Only 1 patient (1.66%) reported severe pain after 72 hours. Moderate pain was reported by 7, 4 and 3 patients after 24, 48 and 72 hours, respectively (p = 0.0001). However, paired analyses showed a statistically significant difference only between 24 and 72 hours (p = 0.04). Sealer extrusion did not influence the postoperative pain (p > 0.05). Conclusions: Acute or moderate postoperative pain was uncommon after endodontic treatment of necrotic teeth with LIFE.

Part-of-speech Tagging for Hindi Corpus in Poor Resource Scenario

  • Modi, Deepa;Nain, Neeta;Nehra, Maninder
    • Journal of Multimedia Information System
    • /
    • v.5 no.3
    • /
    • pp.147-154
    • /
    • 2018
  • Natural language processing (NLP) is an emerging research area in which we study how machines can be used to perceive and alter the text written in natural languages. We can perform different tasks on natural languages by analyzing them through various annotational tasks like parsing, chunking, part-of-speech tagging and lexical analysis etc. These annotational tasks depend on morphological structure of a particular natural language. The focus of this work is part-of-speech tagging (POS tagging) on Hindi language. Part-of-speech tagging also known as grammatical tagging is a process of assigning different grammatical categories to each word of a given text. These grammatical categories can be noun, verb, time, date, number etc. Hindi is the most widely used and official language of India. It is also among the top five most spoken languages of the world. For English and other languages, a diverse range of POS taggers are available, but these POS taggers can not be applied on the Hindi language as Hindi is one of the most morphologically rich language. Furthermore there is a significant difference between the morphological structures of these languages. Thus in this work, a POS tagger system is presented for the Hindi language. For Hindi POS tagging a hybrid approach is presented in this paper which combines "Probability-based and Rule-based" approaches. For known word tagging a Unigram model of probability class is used, whereas for tagging unknown words various lexical and contextual features are used. Various finite state machine automata are constructed for demonstrating different rules and then regular expressions are used to implement these rules. A tagset is also prepared for this task, which contains 29 standard part-of-speech tags. The tagset also includes two unique tags, i.e., date tag and time tag. These date and time tags support all possible formats. Regular expressions are used to implement all pattern based tags like time, date, number and special symbols. The aim of the presented approach is to increase the correctness of an automatic Hindi POS tagging while bounding the requirement of a large human-made corpus. This hybrid approach uses a probability-based model to increase automatic tagging and a rule-based model to bound the requirement of an already trained corpus. This approach is based on very small labeled training set (around 9,000 words) and yields 96.54% of best precision and 95.08% of average precision. The approach also yields best accuracy of 91.39% and an average accuracy of 88.15%.

PPEditor: Semi-Automatic Annotation Tool for Korean Dependency Structure (PPEditor: 한국어 의존구조 부착을 위한 반자동 말뭉치 구축 도구)

  • Kim Jae-Hoon;Park Eun-Jin
    • The KIPS Transactions:PartB
    • /
    • v.13B no.1 s.104
    • /
    • pp.63-70
    • /
    • 2006
  • In general, a corpus contains lots of linguistic information and is widely used in the field of natural language processing and computational linguistics. The creation of such the corpus, however, is an expensive, labor-intensive and time-consuming work. To alleviate this problem, annotation tools to build corpora with much linguistic information is indispensable. In this paper, we design and implement an annotation tool for establishing a Korean dependency tree-tagged corpus. The most ideal way is to fully automatically create the corpus without annotators' interventions, but as a matter of fact, it is impossible. The proposed tool is semi-automatic like most other annotation tools and is designed to edit errors, which are generated by basic analyzers like part-of-speech tagger and (partial) parser. We also design it to avoid repetitive works while editing the errors and to use it easily and friendly. Using the proposed annotation tool, 10,000 Korean sentences containing over 20 words are annotated with dependency structures. For 2 months, eight annotators have worked every 4 hours a day. We are confident that we can have accurate and consistent annotations as well as reduced labor and time.