• Title/Summary/Keyword: NER(Named Entity Recognition)

Search Result 45, Processing Time 0.027 seconds

Performance Comparison Analysis on Named Entity Recognition system with Bi-LSTM based Multi-task Learning (다중작업학습 기법을 적용한 Bi-LSTM 개체명 인식 시스템 성능 비교 분석)

  • Kim, GyeongMin;Han, Seunggnyu;Oh, Dongsuk;Lim, HeuiSeok
    • Journal of Digital Convergence
    • /
    • v.17 no.12
    • /
    • pp.243-248
    • /
    • 2019
  • Multi-Task Learning(MTL) is a training method that trains a single neural network with multiple tasks influences each other. In this paper, we compare performance of MTL Named entity recognition(NER) model trained with Korean traditional culture corpus and other NER model. In training process, each Bi-LSTM layer of Part of speech tagging(POS-tagging) and NER are propagated from a Bi-LSTM layer to obtain the joint loss. As a result, the MTL based Bi-LSTM model shows 1.1%~4.6% performance improvement compared to single Bi-LSTM models.

Deep recurrent neural networks with word embeddings for Urdu named entity recognition

  • Khan, Wahab;Daud, Ali;Alotaibi, Fahd;Aljohani, Naif;Arafat, Sachi
    • ETRI Journal
    • /
    • v.42 no.1
    • /
    • pp.90-100
    • /
    • 2020
  • Named entity recognition (NER) continues to be an important task in natural language processing because it is featured as a subtask and/or subproblem in information extraction and machine translation. In Urdu language processing, it is a very difficult task. This paper proposes various deep recurrent neural network (DRNN) learning models with word embedding. Experimental results demonstrate that they improve upon current state-of-the-art NER approaches for Urdu. The DRRN models evaluated include forward and bidirectional extensions of the long short-term memory and back propagation through time approaches. The proposed models consider both language-dependent features, such as part-of-speech tags, and language-independent features, such as the "context windows" of words. The effectiveness of the DRNN models with word embedding for NER in Urdu is demonstrated using three datasets. The results reveal that the proposed approach significantly outperforms previous conditional random field and artificial neural network approaches. The best f-measure values achieved on the three benchmark datasets using the proposed deep learning approaches are 81.1%, 79.94%, and 63.21%, respectively.

HMM-based Korean Named Entity Recognition (HMM에 기반한 한국어 개체명 인식)

  • Hwang, Yi-Gyu;Yun, Bo-Hyun
    • The KIPS Transactions:PartB
    • /
    • v.10B no.2
    • /
    • pp.229-236
    • /
    • 2003
  • Named entity recognition is the process indispensable to question answering and information extraction systems. This paper presents an HMM based named entity (m) recognition method using the construction principles of compound words. In Korean, many named entities can be decomposed into more than one word. Moreover, there are contextual relationships among nouns in an NE, and among an NE and its surrounding words. In this paper, we classify words into a word as an NE in itself, a word in an NE, and/or a word adjacent to an n, and train an HMM based on NE-related word types and parts of speech. Proposed named entity recognition (NER) system uses trigram model of HMM for considering variable length of NEs. However, the trigram model of HMM has a serious data sparseness problem. In order to solve the problem, we use multi-level back-offs. Experimental results show that our NER system can achieve an F-measure of 87.6% in the economic articles.

Comparative study of text representation and learning for Persian named entity recognition

  • Pour, Mohammad Mahdi Abdollah;Momtazi, Saeedeh
    • ETRI Journal
    • /
    • v.44 no.5
    • /
    • pp.794-804
    • /
    • 2022
  • Transformer models have had a great impact on natural language processing (NLP) in recent years by realizing outstanding and efficient contextualized language models. Recent studies have used transformer-based language models for various NLP tasks, including Persian named entity recognition (NER). However, in complex tasks, for example, NER, it is difficult to determine which contextualized embedding will produce the best representation for the tasks. Considering the lack of comparative studies to investigate the use of different contextualized pretrained models with sequence modeling classifiers, we conducted a comparative study about using different classifiers and embedding models. In this paper, we use different transformer-based language models tuned with different classifiers, and we evaluate these models on the Persian NER task. We perform a comparative analysis to assess the impact of text representation and text classification methods on Persian NER performance. We train and evaluate the models on three different Persian NER datasets, that is, MoNa, Peyma, and Arman. Experimental results demonstrate that XLM-R with a linear layer and conditional random field (CRF) layer exhibited the best performance. This model achieved phrase-based F-measures of 70.04, 86.37, and 79.25 and word-based F scores of 78, 84.02, and 89.73 on the MoNa, Peyma, and Arman datasets, respectively. These results represent state-of-the-art performance on the Persian NER task.

Named entity recognition using transfer learning and small human- and meta-pseudo-labeled datasets

  • Kyoungman Bae;Joon-Ho Lim
    • ETRI Journal
    • /
    • v.46 no.1
    • /
    • pp.59-70
    • /
    • 2024
  • We introduce a high-performance named entity recognition (NER) model for written and spoken language. To overcome challenges related to labeled data scarcity and domain shifts, we use transfer learning to leverage our previously developed KorBERT as the base model. We also adopt a meta-pseudo-label method using a teacher/student framework with labeled and unlabeled data. Our model presents two modifications. First, the student model is updated with an average loss from both human- and pseudo-labeled data. Second, the influence of noisy pseudo-labeled data is mitigated by considering feedback scores and updating the teacher model only when below a threshold (0.0005). We achieve the target NER performance in the spoken language domain and improve that in the written language domain by proposing a straightforward rollback method that reverts to the best model based on scarce human-labeled data. Further improvement is achieved by adjusting the label vector weights in the named entity dictionary.

OryzaGP 2021 update: a rice gene and protein dataset for named-entity recognition

  • Larmande, Pierre;Liu, Yusha;Yao, Xinzhi;Xia, Jingbo
    • Genomics & Informatics
    • /
    • v.19 no.3
    • /
    • pp.27.1-27.4
    • /
    • 2021
  • Due to the rapid evolution of high-throughput technologies, a tremendous amount of data is being produced in the biological domain, which poses a challenging task for information extraction and natural language understanding. Biological named entity recognition (NER) and named entity normalisation (NEN) are two common tasks aiming at identifying and linking biologically important entities such as genes or gene products mentioned in the literature to biological databases. In this paper, we present an updated version of OryzaGP, a gene and protein dataset for rice species created to help natural language processing (NLP) tools in processing NER and NEN tasks. To create the dataset, we selected more than 15,000 abstracts associated with articles previously curated for rice genes. We developed four dictionaries of gene and protein names associated with database identifiers. We used these dictionaries to annotate the dataset. We also annotated the dataset using pretrained NLP models. Finally, we analysed the annotation results and discussed how to improve OryzaGP.

Improving classification of low-resource COVID-19 literature by using Named Entity Recognition

  • Lithgow-Serrano, Oscar;Cornelius, Joseph;Kanjirangat, Vani;Mendez-Cruz, Carlos-Francisco;Rinaldi, Fabio
    • Genomics & Informatics
    • /
    • v.19 no.3
    • /
    • pp.22.1-22.5
    • /
    • 2021
  • Automatic document classification for highly interrelated classes is a demanding task that becomes more challenging when there is little labeled data for training. Such is the case of the coronavirus disease 2019 (COVID-19) clinical repository-a repository of classified and translated academic articles related to COVID-19 and relevant to the clinical practice-where a 3-way classification scheme is being applied to COVID-19 literature. During the 7th Biomedical Linked Annotation Hackathon (BLAH7) hackathon, we performed experiments to explore the use of named-entity-recognition (NER) to improve the classification. We processed the literature with OntoGene's Biomedical Entity Recogniser (OGER) and used the resulting identified Named Entities (NE) and their links to major biological databases as extra input features for the classifier. We compared the results with a baseline model without the OGER extracted features. In these proof-of-concept experiments, we observed a clear gain on COVID-19 literature classification. In particular, NE's origin was useful to classify document types and NE's type for clinical specialties. Due to the limitations of the small dataset, we can only conclude that our results suggests that NER would benefit this classification task. In order to accurately estimate this benefit, further experiments with a larger dataset would be needed.

A Collaborative Framework for Discovering the Organizational Structure of Social Networks Using NER Based on NLP (NLP기반 NER을 이용해 소셜 네트워크의 조직 구조 탐색을 위한 협력 프레임 워크)

  • Elijorde, Frank I.;Yang, Hyun-Ho;Lee, Jae-Wan
    • Journal of Internet Computing and Services
    • /
    • v.13 no.2
    • /
    • pp.99-108
    • /
    • 2012
  • Many methods had been developed to improve the accuracy of extracting information from a vast amount of data. This paper combined a number of natural language processing methods such as NER (named entity recognition), sentence extraction, and part of speech tagging to carry out text analysis. The data source is comprised of texts obtained from the web using a domain-specific data extraction agent. A framework for the extraction of information from unstructured data was developed using the aforementioned natural language processing methods. We simulated the performance of our work in the extraction and analysis of texts for the detection of organizational structures. Simulation shows that our study outperformed other NER classifiers such as MUC and CoNLL on information extraction.

Improving spaCy dependency annotation and PoS tagging web service using independent NER services

  • Colic, Nico;Rinaldi, Fabio
    • Genomics & Informatics
    • /
    • v.17 no.2
    • /
    • pp.21.1-21.6
    • /
    • 2019
  • Dependency parsing is often used as a component in many text analysis pipelines. However, performance, especially in specialized domains, suffers from the presence of complex terminology. Our hypothesis is that including named entity annotations can improve the speed and quality of dependency parses. As part of BLAH5, we built a web service delivering improved dependency parses by taking into account named entity annotations obtained by third party services. Our evaluation shows improved results and better speed.

PharmacoNER Tagger: a deep learning-based tool for automatically finding chemicals and drugs in Spanish medical texts

  • Armengol-Estape, Jordi;Soares, Felipe;Marimon, Montserrat;Krallinger, Martin
    • Genomics & Informatics
    • /
    • v.17 no.2
    • /
    • pp.15.1-15.7
    • /
    • 2019
  • Automatically detecting mentions of pharmaceutical drugs and chemical substances is key for the subsequent extraction of relations of chemicals with other biomedical entities such as genes, proteins, diseases, adverse reactions or symptoms. The identification of drug mentions is also a prior step for complex event types such as drug dosage recognition, duration of medical treatments or drug repurposing. Formally, this task is known as named entity recognition (NER), meaning automatically identifying mentions of predefined entities of interest in running text. In the domain of medical texts, for chemical entity recognition (CER), techniques based on hand-crafted rules and graph-based models can provide adequate performance. In the recent years, the field of natural language processing has mainly pivoted to deep learning and state-of-the-art results for most tasks involving natural language are usually obtained with artificial neural networks. Competitive resources for drug name recognition in English medical texts are already available and heavily used, while for other languages such as Spanish these tools, although clearly needed were missing. In this work, we adapt an existing neural NER system, NeuroNER, to the particular domain of Spanish clinical case texts, and extend the neural network to be able to take into account additional features apart from the plain text. NeuroNER can be considered a competitive baseline system for Spanish drug and CER promoted by the Spanish national plan for the advancement of language technologies (Plan TL).