• Title/Summary/Keyword: morpheme analyzer

Search Result 43, Processing Time 0.024 seconds

Morphological Analysis with Adjacency Attributes and Phrase Dictionary (접속 특성과 말마디 사전을 이용한 형태소 분석)

  • Im, Gwon-Muk;Song, Man-Seok
    • The Transactions of the Korea Information Processing Society
    • /
    • v.1 no.1
    • /
    • pp.129-139
    • /
    • 1994
  • This paper presents a morphological analysis method for the Korean language. The characteristics and adjacency information of the words can be obtained from sentences in a large corpus. Generally a word can be analyzed to a result by applying the adjacency attributes and rules. However, we have to choose one from the several results for the ambiguous words. The collected morpheme's adjacency attributes and relations with neighbor words are recorded in a well designed dictionaries. With this information, abbreviated words as well as ambiguous words can be almost analyzed successfully. Efficiency of morphological analyzer depends on the information in the dictionaries. A morpheme dictionary and a phrase dictionary have been designed with lexical database, and necessary information extracted from the corpus is stored in the dictionaries.

  • PDF

Syllable-Level Lightweight Korean POS Tagger using Transformer Encoder (트랜스포머 인코더를 활용한 음절 단위 경량화 형태소 분석기)

  • Suyoung Min;Youngjoong Ko
    • The Transactions of the Korea Information Processing Society
    • /
    • v.13 no.10
    • /
    • pp.553-558
    • /
    • 2024
  • Morphological analysis involves segmenting morphemes, the smallest units of meaning or grammatical function in a language, and assigning part-of-speech tags to each morpheme. It plays a critical role in various natural language processing tasks, such as named entity recognition and dependency parsing. Much of modern natural language processing relies on deep learning-based language models, and Korean morphological analysis can be broadly categorized into sequence-to-sequence methods and sequential labeling methods. This study proposes a morphological analysis approach using the transformer encoder for sequential labeling to perform syllable-level part-of-speech tagging, followed by morpheme restoration and tagging through a pre-analyzed dictionary. Additionally, the CBOW method was used to extract syllable-level embeddings in lower dimensions, designing a lightweight morphological analyzer model with reduced parameters. The proposed model achieves fast inference speed and low parameter usage, making it efficient for use in resource-constrained environments.

The Research Trend Analysis of the Korean Journal of Physical Education using Mecab-ko Morphology Analyzer (Mecab-ko 형태소 분석을 이용한 한국체육학회지 연구동향 분석)

  • Park, Sung-Geon;Kim, Wanseop;Lee, Dae-Taek
    • 한국체육학회지인문사회과학편
    • /
    • v.56 no.6
    • /
    • pp.595-605
    • /
    • 2017
  • The purpose of this study is to investigate what kind of research fields are preferred by the researcher of the Korean Physical Education Society using the Mecab-ko morpheme analysis and whether there are differences in the interests of researchers between the humanities and social sciences and natural sciences. A total of the data collected for this study are 5,014 papers published online from March 2002 to March 2017 in the Korean Journal of Physical Education was collected. In this study, we used Mecab-ko morpheme analyzer to extract the keyword from the collected documents. As a result, the study found that the number of papers published in KAHPERD appeared to be decreasing. It was also that the main concern of researchers in KAHPERD toward was leisure, live sports and health were relatively higher than the improvement of performance. The research subjects that were interested in the research were women, middle-aged and elderly. The study found that researchers in the humanities and social sciences have shown interest in both traditional research and social interests, while researchers in the natural sciences have shown an interest in a deeper study of traditional research. In conclusion, in order to realize the revitalization of sports convergence research, it is necessary to establish standards for the field of study which should focus on the depth and breadth of research.

A Convergence Study for Development of Psychological Language Analysis Program: Comparison of Existing Programs and Trend Analysis of Related Literature (심리학적 언어분석 프로그램 개발을 위한 융합연구: 기존 프로그램의 비교와 관련 문헌의 동향 분석)

  • Kim, Youngjun;Choi, Wonil;Kim, Tae Hoon
    • Journal of the Korea Convergence Society
    • /
    • v.12 no.11
    • /
    • pp.1-18
    • /
    • 2021
  • While content word-based frequency analysis has obvious limitations to intentional deception or irony, KLIWC has evolved into functional word analysis and KrKwic has evolved as a way to visualize co-occurrence frequencies. However, after more than 10 years of development, several issues still need improvement. Therefore, we tried to develop a new psychological language analysis program by analyzing KLIWC and KrKwic. First, the two programs were analyzed. In particular, the morpheme classification of KLIWC and the Korean morpheme analyzer was compared to enhance the functional word analysis function, and the psychological dictionary were analyzed to strengthen the psychological analysis. As a result of the analysis, the Hannanum part-of-speech analyzer was the most subdivided, but KLIWC for personal pronouns and KKMA for endings and endings were more subdivided, suggesting the integrated use of multiple part-of-speech analyzers to strengthen functional word analysis. Second, the research trends of studies that analyzed texts with these programs were analyzed. As a result of the analysis, the two programs were used in various academic fields, including the field of Interdisciplinary Studies. In particular, KrKwic was used a lot for the analysis of papers and reports, and KLIWC was used a lot for the comparative study of the writer's thoughts, emotions, and personality. Based on these results, the necessity and direction of development of a new psychological language analysis program were suggested.

Crawlers and Morphological Analyzers Utilize to Identify Personal Information Leaks on the Web System (크롤러와 형태소 분석기를 활용한 웹상 개인정보 유출 판별 시스템)

  • Lee, Hyeongseon;Park, Jaehee;Na, Cheolhun;Jung, Hoekyung
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2017.10a
    • /
    • pp.559-560
    • /
    • 2017
  • Recently, as the problem of personal information leakage has emerged, studies on data collection and web document classification have been made. The existing system judges only the existence of personal information, and there is a problem in that unnecessary data is not filtered because classification of documents published by the same name or user is not performed. In this paper, we propose a system that can identify the types of data or homonyms using the crawler and morphological analyzer for solve the problem. The user collects personal information on the web through the crawler. The collected data can be classified through the morpheme analyzer, and then the leaked data can be confirmed. Also, if the system is reused, more accurate results can be obtained. It is expected that users will be provided with customized data.

  • PDF

A Character Identification Method using Postpositions for Animate Nouns in Korean Novels (한국어 소설에서 유정명사용 조사 기반의 인물 추출 기법)

  • Park, Taekeun;Kim, Seung-Hoon
    • Journal of Information Technology Services
    • /
    • v.15 no.3
    • /
    • pp.115-125
    • /
    • 2016
  • Novels includes various character names, depending on the genre and the spatio-temporal background of the novels and the nationality of characters. Besides, characters and their names in a novel are created by the author's pen and imagination. As a result, any proper noun dictionary cannot include all kind of character names which have been created or will be created by authors. In addition, since Korean does not have capitalization feature, character names in Korean are harder to detect than those in English. Fortunately, however, Korean has postpositions, such as "-ege" and "hante", used by a sentient being or an animate object (noun). We call such postpositions as animate postpositions in this paper. In a previous study, the authors manually selected character names by referencing both Wikipedia and well-known people dictionaries after utilizing Korean morpheme analyzer, a proper noun dictionary, postpositions (e.g., "-ga", "-eun", "-neun", "-eui", and "-ege"), and titles (e.g., "buin"), in order to extract social networks from three novels translated into or written in Korean. But, the precision, recall, and F-measure rates of character identification are not presented in the study. In this paper, we evaluate the quantitative contribution of animate postpositions to character identification from novels, in terms of precision, recall, and F-measure. The results show that utilizing animate postpositions is a valuable and powerful tool in character identification without a proper noun dictionary from novels translated into or written in Korean.

A Method for Clustering Noun Phrases into Coreferents for the Same Person in Novels Translated into Korean (한국어 번역 소설에서 인물명 명사구의 동일인물 공통참조 클러스터링 방법)

  • Park, Taekeun;Kim, Seung-Hoon
    • Journal of Korea Multimedia Society
    • /
    • v.20 no.3
    • /
    • pp.533-542
    • /
    • 2017
  • Novels include various character names, depending on the genre and the spatio-temporal background of the novels and the nationality of characters. Besides, characters and their names in a novel are created by the author's pen and imagination. As a result, any proper noun dictionary cannot include all kinds of character names. In addition, the novels translated into Korean have character names consisting of two or more nouns (such as "Harry Potter"). In this paper, we propose a method to extract noun phrases for character names and to cluster the noun phrases into coreferents for the same character name. In the extraction of noun phrases, we utilize KKMA morpheme analyzer and CPFoAN character identification tool. In clustering the noun phrases into coreferents, we construct a directed graph with the character names extracted by CPFoAN and the extracted noun phrases, and then we create name sets for characters by traversing connected subgraphs in the directed graph. With four novels translated into Korean, we conduct a survey to evaluate the proposed method. The results show that the proposed method will be useful for speaker identification as well as for constructing the social network of characters.

A Study on the Development of a Practical Morphological Analysis System Based on Word Analysis (어절 분석 기반 형태소 분석 시스템 개발에 관한 연구)

  • 조현양;최성필;최재황
    • Journal of the Korean Society for information Management
    • /
    • v.18 no.2
    • /
    • pp.105-124
    • /
    • 2001
  • The purpose of this study is to develop a Korean word analysis system, which can improve performance of IRS, based on various methods of word analysis. In this study we focused on maximizing the speed of Korean word analysis, modulizing each functional system and analyzing Korean morpheme precisely. The system, developed in this study, implemented optimal algorithm to increase the speed of word analysis and to verify speed and performance of each subsystem. In addition, the numeral analysis processing was achieved to reduce a system burden by avoiding recursive analysis of compound nouns, based on numeral pattern recognition.

  • PDF

Probabilistic Segmentation and Tagging of Unknown Words (확률 기반 미등록 단어 분리 및 태깅)

  • Kim, Bogyum;Lee, Jae Sung
    • Journal of KIISE
    • /
    • v.43 no.4
    • /
    • pp.430-436
    • /
    • 2016
  • Processing of unknown words such as proper nouns and newly coined words is important for a morphological analyzer to process documents in various domains. In this study, a segmentation and tagging method for unknown Korean words is proposed for the 3-step probabilistic morphological analysis. For guessing unknown word, it uses rich suffixes that are attached to open class words, such as general nouns and proper nouns. We propose a method to learn the suffix patterns from a morpheme tagged corpus, and calculate their probabilities for unknown open word segmentation and tagging in the probabilistic morphological analysis model. Results of the experiment showed that the performance of unknown word processing is greatly improved in the documents containing many unregistered words.

Development of Tourism Information Named Entity Recognition Datasets for the Fine-tune KoBERT-CRF Model

  • Jwa, Myeong-Cheol;Jwa, Jeong-Woo
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.14 no.2
    • /
    • pp.55-62
    • /
    • 2022
  • A smart tourism chatbot is needed as a user interface to efficiently provide smart tourism services such as recommended travel products, tourist information, my travel itinerary, and tour guide service to tourists. We have been developed a smart tourism app and a smart tourism information system that provide smart tourism services to tourists. We also developed a smart tourism chatbot service consisting of khaiii morpheme analyzer, rule-based intention classification, and tourism information knowledge base using Neo4j graph database. In this paper, we develop the Korean and English smart tourism Name Entity (NE) datasets required for the development of the NER model using the pre-trained language models (PLMs) for the smart tourism chatbot system. We create the tourism information NER datasets by collecting source data through smart tourism app, visitJeju web of Jeju Tourism Organization (JTO), and web search, and preprocessing it using Korean and English tourism information Name Entity dictionaries. We perform training on the KoBERT-CRF NER model using the developed Korean and English tourism information NER datasets. The weight-averaged precision, recall, and f1 scores are 0.94, 0.92 and 0.94 on Korean and English tourism information NER datasets.