• Title/Summary/Keyword: Named entity classification

Search Result 20, Processing Time 0.144 seconds

Development of Tourism Information Named Entity Recognition Datasets for the Fine-tune KoBERT-CRF Model

  • Jwa, Myeong-Cheol;Jwa, Jeong-Woo
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.14 no.2
    • /
    • pp.55-62
    • /
    • 2022
  • A smart tourism chatbot is needed as a user interface to efficiently provide smart tourism services such as recommended travel products, tourist information, my travel itinerary, and tour guide service to tourists. We have been developed a smart tourism app and a smart tourism information system that provide smart tourism services to tourists. We also developed a smart tourism chatbot service consisting of khaiii morpheme analyzer, rule-based intention classification, and tourism information knowledge base using Neo4j graph database. In this paper, we develop the Korean and English smart tourism Name Entity (NE) datasets required for the development of the NER model using the pre-trained language models (PLMs) for the smart tourism chatbot system. We create the tourism information NER datasets by collecting source data through smart tourism app, visitJeju web of Jeju Tourism Organization (JTO), and web search, and preprocessing it using Korean and English tourism information Name Entity dictionaries. We perform training on the KoBERT-CRF NER model using the developed Korean and English tourism information NER datasets. The weight-averaged precision, recall, and f1 scores are 0.94, 0.92 and 0.94 on Korean and English tourism information NER datasets.

KorPatELECTRA : A Pre-trained Language Model for Korean Patent Literature to improve performance in the field of natural language processing(Korean Patent ELECTRA)

  • Jang, Ji-Mo;Min, Jae-Ok;Noh, Han-Sung
    • Journal of the Korea Society of Computer and Information
    • /
    • v.27 no.2
    • /
    • pp.15-23
    • /
    • 2022
  • In the field of patents, as NLP(Natural Language Processing) is a challenging task due to the linguistic specificity of patent literature, there is an urgent need to research a language model optimized for Korean patent literature. Recently, in the field of NLP, there have been continuous attempts to establish a pre-trained language model for specific domains to improve performance in various tasks of related fields. Among them, ELECTRA is a pre-trained language model by Google using a new method called RTD(Replaced Token Detection), after BERT, for increasing training efficiency. The purpose of this paper is to propose KorPatELECTRA pre-trained on a large amount of Korean patent literature data. In addition, optimal pre-training was conducted by preprocessing the training corpus according to the characteristics of the patent literature and applying patent vocabulary and tokenizer. In order to confirm the performance, KorPatELECTRA was tested for NER(Named Entity Recognition), MRC(Machine Reading Comprehension), and patent classification tasks using actual patent data, and the most excellent performance was verified in all the three tasks compared to comparative general-purpose language models.

Multi-labeled Domain Detection Using CNN (CNN을 이용한 발화 주제 다중 분류)

  • Choi, Kyoungho;Kim, Kyungduk;Kim, Yonghe;Kang, Inho
    • Annual Conference on Human and Language Technology
    • /
    • 2017.10a
    • /
    • pp.56-59
    • /
    • 2017
  • CNN(Convolutional Neural Network)을 이용하여 발화 주제 다중 분류 task를 multi-labeling 방법과, cluster 방법을 이용하여 수행하고, 각 방법론에 MSE(Mean Square Error), softmax cross-entropy, sigmoid cross-entropy를 적용하여 성능을 평가하였다. Network는 음절 단위로 tokenize하고, 품사정보를 각 token의 추가한 sequence와, Naver DB를 통하여 얻은 named entity 정보를 입력으로 사용한다. 실험결과 cluster 방법으로 문제를 변형하고, sigmoid를 output layer의 activation function으로 사용하고 cross entropy cost function을 이용하여 network를 학습시켰을 때 F1 0.9873으로 가장 좋은 성능을 보였다.

  • PDF

Detecting and classification ADRs using Named Entity Recognition on social media (개체명 인식을 이용한 소셜 미디어에서의 약물 부작용 표현 추출 및 분류)

  • Jeong, Hyeon-jeong;Kim, Hyon Hee
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2021.05a
    • /
    • pp.443-446
    • /
    • 2021
  • 의약품에 대한 안전성 정보 수집과 관리는 온라인, 오프라인을 통해 약물 이상 사례를 보고받는 형태로 진행되고 있다. 하지만 소비자들의 자발적인 참여로 이루어지므로 실제 발생하는 약물 부작용보다 데이터가 현저히 적다는 단점이 존재한다. 본 논문에서는 약물 이상 데이터 희소성 문제를 해결 할 수 있도록 소셜 미디어에서 약물 부작용 표현을 찾을 수 있도록 하였다. 소셜 미디어의 경우에는 표준 약물 부작용 용어를 사용하기보다는 일반인들이 자연어로 표현한 경우가 많으므로 개체명 인식 기법을 이용해 부작용을 추출할 수 있는 모델을 개발하였다. 또한 추출된 부작용 표현을 표준용어로 분류할 수 있는 모델을 제시하였다. 실험 결과 제안한 두 가지 모델은 0.9 이상의 정확도를 얻을 수 있었으며, 일반 사용자들이 자연어로 표현한 약물 부작용 표현을 효과적으로 찾아내고 표준 부작용 용어로 매핑할 수 있음을 보여준다.

A Model for Minimum Price Search of Processed Food Items on Online Platforms Based on Quantity and Weight (온라인 가공식품의 수량과 중량에 따른 최저가격 검색 모델)

  • Tae-Min Choi;Heui-Seok Lim
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.11a
    • /
    • pp.458-460
    • /
    • 2023
  • 가공식품이라는 특정 도메인에서는 기존 검색엔진에서 많이 활용되는 BM25 만을 가지고 최저가 검색하는 데는 어려움이 있다. 본 논문에서는 BM25 외에도 검색의 정확성을 높이기 위해 HuggingFace 에 공개되어 있는 KoELECTRA 를 활용하여 개체명 인식(Named Entity Recognition 과 이진 분류모델(Binary Classification)을 Fine-tuning 하고 BM25 와 연계하여 구축한 검색시스템을 제안한다. 기존의 BM25 대비 성능 평가를 통해 효과를 검증하였다.

Linguistic Features Discrimination for Social Issue Risk Classification (사회적 이슈 리스크 유형 분류를 위한 어휘 자질 선별)

  • Oh, Hyo-Jung;Yun, Bo-Hyun;Kim, Chan-Young
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.5 no.11
    • /
    • pp.541-548
    • /
    • 2016
  • The use of social media is already essential as a source of information for listening user's various opinions and monitoring. We define social 'risks' that issues effect negative influences for public opinion in social media. This paper aims to discriminate various linguistic features and reveal their effects for building an automatic classification model of social risks. Expecially we adopt a word embedding technique for representation of linguistic clues in risk sentences. As a preliminary experiment to analyze characteristics of individual features, we revise errors in automatic linguistic analysis. At the result, the most important feature is NE (Named Entity) information and the best condition is when combine basic linguistic features. word embedding, and word clusters within core predicates. Experimental results under the real situation in social bigdata - including linguistic analysis errors - show 92.08% and 85.84% in precision respectively for frequent risk categories set and full test set.

Personal Information Detection and De-identification System using Sentence Intent Classification and Named Entity Recognition (문장 의도 분류와 개체명 인식을 활용한 개인정보 검출 및 비식별화 시스템)

  • Seo, Dong-Kuk;Kim, Gun-Woo;Kim, Jae-Young;Lee, Dong-Ho
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2020.11a
    • /
    • pp.1018-1021
    • /
    • 2020
  • 최근 개인정보가 포함된 비정형 텍스트 문서들이 유출되거나 무분별하게 공개됨으로써 정보의 주체는 물론 기업들까지 피해를 받고 있다. 데이터를 공개 및 활용하기 위해 개인정보 검출 및 비식별화 과정이 필수적이지만 정형 데이터와는 달리 비정형 데이터의 경우 해당 과정을 자동으로 처리하는 데 한계가 있다. 이를 위해 딥러닝 모델들을 사용하여 자동화하려는 연구들이 있었지만 문장 내 단어의 모호성에 대한 고려 없이 단어 개체명 정보에만 의존하여 개인정보를 검출하는 형태로 진행되었다. 따라서 문장 내 단어들 중 식별 대상인 단어들도 비식별화 되어 데이터에 대한 유용성을 저해할 수 있다는 문제점을 남겼다. 본 논문에서는 문장의 의도 정보를 단어의 개체명 학습 과정에 부가적인 정보로 활용하는 개인정보 검출 모델과 개인정보 데이터의 유용성을 고려한 비식별화 기법을 제안한다.

Dentinogenic Ghost Cell Tumor: A Case Report and Review of Literature (상아질성 유령세포종양: 증례보고와 문헌고찰)

  • Kim, Soung Min;Choi, So Young;Lee, Jae Il;Huh, Kyung Hoe;Myoung, Hoon;Lee, Jong Ho
    • Maxillofacial Plastic and Reconstructive Surgery
    • /
    • v.35 no.1
    • /
    • pp.66-71
    • /
    • 2013
  • Dentinogenic ghost cell tumor (DGCT) is a rare epithelial odontogenic neoplasm, representing 1.9% to 2.1% of all odontogenic tumors. It is the neoplastic counterpart of the calcifying odontogenic cyst (COC), and characteristic islands of odontogenic epithelical cells contain numerous ghost cells and dysplastic dentin, and also have many common histological features with ameloblastoma. The 2005 World Health Organization (WHO) Classification of Odontogenic Tumours re-named this entity as calcifying cystic odontogenic tumor (CCOT) and defined the clinico-pathological features of the ghost cell odontogenic tumours, CCOT, DGCT and ghost cell odontogenic carcinoma (GCOC). We report a rare case of central DGCT in the posterior maxilla of a 31-year-old female with literature review, for the emphasis of Oral and Maxillofacial surgeon's role.

Development of the Rule-based Smart Tourism Chatbot using Neo4J graph database

  • Kim, Dong-Hyun;Im, Hyeon-Su;Hyeon, Jong-Heon;Jwa, Jeong-Woo
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.13 no.2
    • /
    • pp.179-186
    • /
    • 2021
  • We have been developed the smart tourism app and the Instagram and YouTube contents to provide personalized tourism information and travel product information to individual tourists. In this paper, we develop a rule-based smart tourism chatbot with the khaiii (Kakao Hangul Analyzer III) morphological analyzer and Neo4J graph database. In the proposed chatbot system, we use a morpheme analyzer, a proper noun dictionary including tourist destination names, and a general noun dictionary including containing frequently used words in tourist information search to understand the intention of the user's question. The tourism knowledge base built using the Neo4J graph database provides adequate answers to tourists' questions. In this paper, the nodes of Neo4J are Area based on tourist destination address, Contents with property of tourist information, and Service including service attribute data frequently used for search. A Neo4J query is created based on the result of analyzing the intention of a tourist's question with the property of nodes and relationships in Neo4J database. An answer to the question is made by searching in the tourism knowledge base. In this paper, we create the tourism knowledge base using more than 1300 Jeju tourism information used in the smart tourism app. We plan to develop a multilingual smart tour chatbot using the named entity recognition (NER), intention classification using conditional random field(CRF), and transfer learning using the pretrained language models.

Korean Word Sense Disambiguation using Dictionary and Corpus (사전과 말뭉치를 이용한 한국어 단어 중의성 해소)

  • Jeong, Hanjo;Park, Byeonghwa
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.1
    • /
    • pp.1-13
    • /
    • 2015
  • As opinion mining in big data applications has been highlighted, a lot of research on unstructured data has made. Lots of social media on the Internet generate unstructured or semi-structured data every second and they are often made by natural or human languages we use in daily life. Many words in human languages have multiple meanings or senses. In this result, it is very difficult for computers to extract useful information from these datasets. Traditional web search engines are usually based on keyword search, resulting in incorrect search results which are far from users' intentions. Even though a lot of progress in enhancing the performance of search engines has made over the last years in order to provide users with appropriate results, there is still so much to improve it. Word sense disambiguation can play a very important role in dealing with natural language processing and is considered as one of the most difficult problems in this area. Major approaches to word sense disambiguation can be classified as knowledge-base, supervised corpus-based, and unsupervised corpus-based approaches. This paper presents a method which automatically generates a corpus for word sense disambiguation by taking advantage of examples in existing dictionaries and avoids expensive sense tagging processes. It experiments the effectiveness of the method based on Naïve Bayes Model, which is one of supervised learning algorithms, by using Korean standard unabridged dictionary and Sejong Corpus. Korean standard unabridged dictionary has approximately 57,000 sentences. Sejong Corpus has about 790,000 sentences tagged with part-of-speech and senses all together. For the experiment of this study, Korean standard unabridged dictionary and Sejong Corpus were experimented as a combination and separate entities using cross validation. Only nouns, target subjects in word sense disambiguation, were selected. 93,522 word senses among 265,655 nouns and 56,914 sentences from related proverbs and examples were additionally combined in the corpus. Sejong Corpus was easily merged with Korean standard unabridged dictionary because Sejong Corpus was tagged based on sense indices defined by Korean standard unabridged dictionary. Sense vectors were formed after the merged corpus was created. Terms used in creating sense vectors were added in the named entity dictionary of Korean morphological analyzer. By using the extended named entity dictionary, term vectors were extracted from the input sentences and then term vectors for the sentences were created. Given the extracted term vector and the sense vector model made during the pre-processing stage, the sense-tagged terms were determined by the vector space model based word sense disambiguation. In addition, this study shows the effectiveness of merged corpus from examples in Korean standard unabridged dictionary and Sejong Corpus. The experiment shows the better results in precision and recall are found with the merged corpus. This study suggests it can practically enhance the performance of internet search engines and help us to understand more accurate meaning of a sentence in natural language processing pertinent to search engines, opinion mining, and text mining. Naïve Bayes classifier used in this study represents a supervised learning algorithm and uses Bayes theorem. Naïve Bayes classifier has an assumption that all senses are independent. Even though the assumption of Naïve Bayes classifier is not realistic and ignores the correlation between attributes, Naïve Bayes classifier is widely used because of its simplicity and in practice it is known to be very effective in many applications such as text classification and medical diagnosis. However, further research need to be carried out to consider all possible combinations and/or partial combinations of all senses in a sentence. Also, the effectiveness of word sense disambiguation may be improved if rhetorical structures or morphological dependencies between words are analyzed through syntactic analysis.