• Title/Summary/Keyword: Named Entity

Search Result 221, Processing Time 0.029 seconds

Construction of Test Collection for Automatically Extracting Technological Knowledge (기술 지식 자동 추출을 위한 테스트 컬렉션 구축)

  • Shin, Sung-Ho;Choi, Yun-Soo;Song, Sa-Kwang;Choi, Sung-Pil;Jung, Han-Min
    • The Journal of the Korea Contents Association
    • /
    • v.12 no.7
    • /
    • pp.463-472
    • /
    • 2012
  • For last decade, the amount of information has been increased rapidly because of the internet and computing technology development, mobile devices and sensors, and social networks like facebook or twitter. People who want to gain important knowledge from database have been frustrated with large database. Many studies for automatic knowledge extracting meaningful knowledge from large database have been fulfilled. In that sense, automatic knowledge extracting with computing technology has been highly significant in information technology field, but still has many challenges to go further. In order to improve the effectives and efficiency of knowledge extracting system, test collection is strongly necessary. In this research, we introduce a test collection for automatic knwoledge extracting. We name the test collection KEEC/KREC(KISTI Entity Extraction Collection/KISTI Relation Extraction Collection) and present the process and guideline for building as well as the features of. The main feature is to tag by experts to guarantee the quality of collection. The experts read documents and tag entities and relation between entities with a tool for tagging. KEEC/KREC is being used for a research to evaluate system performance and will continue to contribute to next researches.

A Study on Building Knowledge Base for Intelligent Battlefield Awareness Service

  • Jo, Se-Hyeon;Kim, Hack-Jun;Jin, So-Yeon;Lee, Woo-Sin
    • Journal of the Korea Society of Computer and Information
    • /
    • v.25 no.4
    • /
    • pp.11-17
    • /
    • 2020
  • In this paper, we propose a method to build a knowledge base based on natural language processing for intelligent battlefield awareness service. The current command and control system manages and utilizes the collected battlefield information and tactical data at a basic level such as registration, storage, and sharing, and information fusion and situation analysis by an analyst is performed. This is an analyst's temporal constraints and cognitive limitations, and generally only one interpretation is drawn, and biased thinking can be reflected. Therefore, it is essential to aware the battlefield situation of the command and control system and to establish the intellignet decision support system. To do this, it is necessary to build a knowledge base specialized in the command and control system and develop intelligent battlefield awareness services based on it. In this paper, among the entity names suggested in the exobrain corpus, which is the private data, the top 250 types of meaningful names were applied and the weapon system entity type was additionally identified to properly represent battlefield information. Based on this, we proposed a way to build a battlefield-aware knowledge base through mention extraction, cross-reference resolution, and relationship extraction.

Multilingual Named Entity Recognition with Limited Language Resources (제한된 언어 자원 환경에서의 다국어 개체명 인식)

  • Cheon, Min-Ah;Kim, Chang-Hyun;Park, Ho-min;Noh, Kyung-Mok;Kim, Jae-Hoon
    • Annual Conference on Human and Language Technology
    • /
    • 2017.10a
    • /
    • pp.143-146
    • /
    • 2017
  • 심층학습 모델 중 LSTM-CRF는 개체명 인식, 품사 태깅과 같은 sequence labeling에서 우수한 성능을 보이고 있다. 한국어 개체명 인식에 대해서도 LSTM-CRF 모델을 기본 골격으로 단어, 형태소, 자모음, 품사, 기구축 사전 정보 등 다양한 정보와 외부 자원을 활용하여 성능을 높이는 연구가 진행되고 있다. 그러나 이런 방법은 언어 자원과 성능이 좋은 자연어 처리 모듈(형태소 세그먼트, 품사 태거 등)이 없으면 사용할 수 없다. 본 논문에서는 LSTM-CRF와 최소한의 언어 자원을 사용하여 다국어에 대한 개체명 인식에 대한 성능을 평가한다. LSTM-CRF의 입력은 문자 기반의 n-gram 표상으로, 성능 평가에는 unigram 표상과 bigram 표상을 사용했다. 한국어, 일본어, 중국어에 대해 개체명 인식 성능 평가를 한 결과 한국어의 경우 bigram을 사용했을 때 78.54%의 성능을, 일본어와 중국어는 unigram을 사용했을 때 각 63.2%, 26.65%의 성능을 보였다.

  • PDF

Syllables-based Named Entity Extraction and Automatic Corpus Construction using Bidirectional Dynamic LST (Bidirectional Dynamic LSTM 을 이용한 음절 단위 개체명 추출 및 자동화된 말뭉치 구축)

  • Oh, Sungsik;Lim, Changdae;Ahn, Keeho;Park, Weijin
    • Annual Conference on Human and Language Technology
    • /
    • 2017.10a
    • /
    • pp.317-320
    • /
    • 2017
  • 개체명 인식은 자연어 문장에서 장소, 제작물, 사람 등 분류를 통한 의미 부여가 가능한 단어를 파악하는 기술로서 의미 분석을 위한 핵심 기술이다. 현재 많은 개체명 분석 관련 연구들은 형태소 분석 결과에 의존적인 형태를 갖고 있어서, 형태소 분석 결과의 정확성이 개체명 분석 결과의 성능에 영향을 미치고 있다. 본 연구에서는 형태소 분석 과정을 거치지 않는 음절 기반의 개체명 분석 기술을 제안하여 형태소 분석의 정확도가 낮은 통신어, 신조어 분석 성능을 향상하였다. 또한, 자동화된 방법으로 음절 단위 개체명 말뭉치 및 개체명 사전을 구축하는 프로세스를 정의하여 개체명 분석의 정확도 향상 및 인지 범주의 확대를 도모하였다. 본 연구에서 제안한 개체명 인식 기술은 한국어 개체명 표준에 기반한 129가지의 개체명 분류가 가능하며, 이는 자연어 처리 기술이 필요한 산업계에서 상용화하는데 큰 기여를 할 것으로 판단된다.

  • PDF

Korean Named Entity Recognition using Joint Learning with Language Model (언어 모델 다중 학습을 이용한 한국어 개체명 인식)

  • Kim, Byeong-Jae;Park, Chan-min;Choi, Yoon-Young;Kwon, Myeong-Joon;Seo, Jeong-Yeon
    • Annual Conference on Human and Language Technology
    • /
    • 2017.10a
    • /
    • pp.333-337
    • /
    • 2017
  • 본 논문에서는 개체명 인식과 언어 모델의 다중 학습을 이용한 한국어 개체명 인식 방법을 제안한다. 다중 학습은 1 개의 모델에서 2 개 이상의 작업을 동시에 분석하여 성능 향상을 기대할 수 있는 방법이지만, 이를 적용하기 위해서 말뭉치에 각 작업에 해당하는 태그가 부착되어야 하는 문제가 있다. 본 논문에서는 추가적인 태그 부착 없이 정보를 획득할 수 있는 언어 모델을 개체명 인식 작업과 결합하여 성능 향상을 이루고자 한다. 또한 단순한 형태소 입력의 한계를 극복하기 위해 입력 표상을 자소 및 형태소 품사의 임베딩으로 확장하였다. 기계 학습 방법은 순차적 레이블링에서 높은 성능을 제공하는 Bi-directional LSTM CRF 모델을 사용하였고, 실험 결과 언어 모델이 개체명 인식의 오류를 효과적으로 개선함을 확인하였다.

  • PDF

Heuristic-based Korean Coreference Resolution for Information Extraction

  • Euisok Chung;Soojong Lim;Yun, Bo-Hyun
    • Proceedings of the Korean Society for Language and Information Conference
    • /
    • 2002.02a
    • /
    • pp.50-58
    • /
    • 2002
  • The information extraction is to delimit in advance, as part of the specification of the task, the semantic range of the output and to filter information from large volumes of texts. The most representative word of the document is composed of named entities and pronouns. Therefore, it is important to resolve coreference in order to extract the meaningful information in information extraction. Coreference resolution is to find name entities co-referencing real-world entities in the documents. Results of coreference resolution are used for name entity detection and template generation. This paper presents the heuristic-based approach for coreference resolution in Korean. We constructed the heuristics expanded gradually by using the corpus and derived the salience factors of antecedents as the importance measure in Korean. Our approach consists of antecedents selection and antecedents weighting. We used three kinds of salience factors that are used to weight each antecedent of the anaphor. The experiment result shows 80% precision.

  • PDF

Vocabulary Coverage Improvement for Embedded Continuous Speech Recognition Using Knowledgebase (지식베이스를 이용한 임베디드용 연속음성인식의 어휘 적용률 개선)

  • Kim, Kwang-Ho;Lim, Min-Kyu;Kim, Ji-Hwan
    • MALSORI
    • /
    • v.68
    • /
    • pp.115-126
    • /
    • 2008
  • In this paper, we propose a vocabulary coverage improvement method for embedded continuous speech recognition (CSR) using knowledgebase. A vocabulary in CSR is normally derived from a word frequency list. Therefore, the vocabulary coverage is dependent on a corpus. In the previous research, we presented an improved way of vocabulary generation using part-of-speech (POS) tagged corpus. We analyzed all words paired with 101 among 152 POS tags and decided on a set of words which have to be included in vocabularies of any size. However, for the other 51 POS tags (e.g. nouns, verbs), the vocabulary inclusion of words paired with such POS tags are still based on word frequency counted on a corpus. In this paper, we propose a corpus independent word inclusion method for noun-, verb-, and named entity(NE)-related POS tags using knowledgebase. For noun-related POS tags, we generate synonym groups and analyze their relative importance using Google search. Then, we categorize verbs by lemma and analyze relative importance of each lemma from a pre-analyzed statistic for verbs. We determine the inclusion order of NEs through Google search. The proposed method shows better coverage for the test short message service (SMS) text corpus.

  • PDF

SVM-based Protein Name Recognition using Edit-Distance Features Boosted by Virtual Examples (가상 예제와 Edit-distance 자질을 이용한 SVM 기반의 단백질명 인식)

  • Yi, Eun-Ji;Lee, Gary-Geunbae;Park, Soo-Jun
    • Proceedings of the Korean Society for Bioinformatics Conference
    • /
    • 2003.10a
    • /
    • pp.95-100
    • /
    • 2003
  • In this paper, we propose solutions to resolve the problem of many spelling variants and the problem of lack of annotated corpus for training, which are two among the main difficulties in named entity recognition in biomedical domain. To resolve the problem of spotting valiants, we propose a use of edit-distance as a feature for SVM. And we propose a use of virtual examples to automatically expand the annotated corpus to resolve the lack-of-corpus problem. Using virtual examples, the annotated corpus can be extended in a fast, efficient and easy way. The experimental results show that the introduction of edit-distance produces some improvements in protein name recognition performance. And the model, which is trained with the corpus expanded by virtual examples, outperforms the model trained with the original corpus. According to the proposed methods, we finally achieve the performance 75.80 in F-measure(71.89% in precision,80.15% in recall) in the experiment of protein name recognition on GENIA corpus (ver.3.0).

  • PDF

Natural language processing techniques for bioinformatics

  • Tsujii, Jun-ichi
    • Proceedings of the Korean Society for Bioinformatics Conference
    • /
    • 2003.10a
    • /
    • pp.3-3
    • /
    • 2003
  • With biomedical literature expanding so rapidly, there is an urgent need to discover and organize knowledge extracted from texts. Although factual databases contain crucial information the overwhelming amount of new knowledge remains in textual form (e.g. MEDLINE). In addition, new terms are constantly coined as the relationships linking new genes, drugs, proteins etc. As the size of biomedical literature is expanding, more systems are applying a variety of methods to automate the process of knowledge acquisition and management. In my talk, I focus on the project, GENIA, of our group at the University of Tokyo, the objective of which is to construct an information extraction system of protein - protein interaction from abstracts of MEDLINE. The talk includes (1) Techniques we use fDr named entity recognition (1-a) SOHMM (Self-organized HMM) (1-b) Maximum Entropy Model (1-c) Lexicon-based Recognizer (2) Treatment of term variants and acronym finders (3) Event extraction using a full parser (4) Linguistic resources for text mining (GENIA corpus) (4-a) Semantic Tags (4-b) Structural Annotations (4-c) Co-reference tags (4-d) GENIA ontology I will also talk about possible extension of our work that links the findings of molecular biology with clinical findings, and claim that textual based or conceptual based biology would be a viable alternative to system biology that tends to emphasize the role of simulation models in bioinformatics.

  • PDF

Multilingual Named Entity Recognition with Limited Language Resources (제한된 언어 자원 환경에서의 다국어 개체명 인식)

  • Cheon, Min-Ah;Kim, Chang-Hyun;Park, Ho-min;Noh, Kyung-Mok;Kim, Jae-Hoon
    • 한국어정보학회:학술대회논문집
    • /
    • 2017.10a
    • /
    • pp.143-146
    • /
    • 2017
  • 심층학습 모델 중 LSTM-CRF는 개체명 인식, 품사 태깅과 같은 sequence labeling에서 우수한 성능을 보이고 있다. 한국어 개체명 인식에 대해서도 LSTM-CRF 모델을 기본 골격으로 단어, 형태소, 자모음, 품사, 기구축 사전 정보 등 다양한 정보와 외부 자원을 활용하여 성능을 높이는 연구가 진행되고 있다. 그러나 이런 방법은 언어 자원과 성능이 좋은 자연어 처리 모듈(형태소 세그먼트, 품사 태거 등)이 없으면 사용할 수 없다. 본 논문에서는 LSTM-CRF와 최소한의 언어 자원을 사용하여 다국어에 대한 개체명 인식에 대한 성능을 평가한다. LSTM-CRF의 입력은 문자 기반의 n-gram 표상으로, 성능 평가에는 unigram 표상과 bigram 표상을 사용했다. 한국어, 일본어, 중국어에 대해 개체명 인식 성능 평가를 한 결과 한국어의 경우 bigram을 사용했을 때 78.54%의 성능을, 일본어와 중국어는 unigram을 사용했을 때 각 63.2%, 26.65%의 성능을 보였다.

  • PDF