• Title/Summary/Keyword: Sematic role

Search Result 3, Processing Time 0.014 seconds

Consideration of Sematic Roles of Korean Subcategory in Computational Linguistics (전산언어학에서의 한국어 필수논항의 의미역 상정과 재고)

  • Kim, Yun-Jeong;Kim, Wan-Su;Ock, Cheol-Young
    • Language and Information
    • /
    • v.18 no.2
    • /
    • pp.169-199
    • /
    • 2014
  • This study was performed to assume the Sematic role of the obligatory argument of the predicate in a Korean sentence, and to accomplish the task to attach the assumed thematic role to the real corpus. With this study, the maximum of the Sematic role was determined and the Criterion of the Sematic role was set. The maximum of the Sematic role was determined 22. This study arranged the Sematic role of case marker and attached the Sematic role to the predicate of the sentence within The standard Korean Dictionary. The program to attach the thematic role was developed(UTagger-SR). The Sematic role of case marker and Case frame dictionary was equipped in this program. By attaching the Sematic role, it was found that the most important the Sematic role in the korean sentence is the theme of the predicate and the next is the subject of the predicate.

  • PDF

Korean Sematic Role Labeling Using CRFs (CRFs 기반의 한국어 의미역 결정)

  • Park, Tae-Ho;Cha, Jeong-Won
    • Annual Conference on Human and Language Technology
    • /
    • 2015.10a
    • /
    • pp.11-14
    • /
    • 2015
  • 의미역 결정은 서술어와 논항들 사이의 의미 관계를 결정하는 문제이다. 의미역 결정을 위해 구구조 정보와 의존 구조 정보 등의 다양한 자질에 대한 실험이 있었다. 논항은 구문 구조에서 얻을 수 있는 서술어와 논항 관계에 많은 영향을 받지만 구문 구조가 변경되어도 변하지 않는 논항의 의미로 인해 의미역 결정에 어려운 점이 있다. 본 논문에서는 한국어 의미역 결정 문제를 위해 Korean Propbank 말뭉치와 직접 구축한 의미역 말뭉치를 학습 말뭉치로 사용하였다. 본 논문에서는 이전에 연구된 구문 정보와 그 외의 자질들에 대한 성능을 검증하였다. 본 논문에서 제시하는 자질들의 성능을 검증하기 위해 CRF를 사용하였고, 제시된 새로운 자질을 사용하여 논항의 인식 및 분류에서 76.25%(F1)의 성능을 보였다.

  • PDF

Comparative Analysis of Vectorization Techniques in Electronic Medical Records Classification (의무 기록 문서 분류를 위한 자연어 처리에서 최적의 벡터화 방법에 대한 비교 분석)

  • Yoo, Sung Lim
    • Journal of Biomedical Engineering Research
    • /
    • v.43 no.2
    • /
    • pp.109-115
    • /
    • 2022
  • Purpose: Medical records classification using vectorization techniques plays an important role in natural language processing. The purpose of this study was to investigate proper vectorization techniques for electronic medical records classification. Material and methods: 403 electronic medical documents were extracted retrospectively and classified using the cosine similarity calculated by Scikit-learn (Python module for machine learning) in Jupyter Notebook. Vectors for medical documents were produced by three different vectorization techniques (TF-IDF, latent sematic analysis and Word2Vec) and the classification precisions for three vectorization techniques were evaluated. The Kruskal-Wallis test was used to determine if there was a significant difference among three vectorization techniques. Results: 403 medical documents were relevant to 41 different diseases and the average number of documents per diagnosis was 9.83 (standard deviation=3.46). The classification precisions for three vectorization techniques were 0.78 (TF-IDF), 0.87 (LSA) and 0.79 (Word2Vec). There was a statistically significant difference among three vectorization techniques. Conclusions: The results suggest that removing irrelevant information (LSA) is more efficient vectorization technique than modifying weights of vectorization models (TF-IDF, Word2Vec) for medical documents classification.