• Title/Summary/Keyword: Named Entity

Search Result 219, Processing Time 0.024 seconds

A Named Entity Recognition Platform Based on Semi-Automatically Built NE-annotated Corpora and KoBERT (반자동구축된 개체명 주석코퍼스 DecoNAC과 KoBERT를 이용한 개체명인식 플랫폼 DecoNERO)

  • Kim, Shin-Woo;Hwang, Chang-Hoe;Yoon, Jeong-Woo;Lee, Seong-Hyeon;Choi, Soo-Won;Nam, Jee-Sun
    • Annual Conference on Human and Language Technology
    • /
    • 2020.10a
    • /
    • pp.304-309
    • /
    • 2020
  • 본 연구에서는 한국어 전자사전 DECO(Dictionnaire Electronique du COreen)와 다단어(Multi-Word Expressions: MWE) 개체명을 부분 패턴으로 기술하는 부분문법그래프(Local-Grammar Graph: LGG) 프레임에 기반하여 반자동으로 개체명주석 코퍼스 DecoNAC을 구축한 후, 이를 개체명 분석에 활용하고 또한 기계학습에 필요한 도메인별 학습 데이터로 활용하는 DecoNERO 개체명인식 플랫폼을 소개하는 데에 목적을 두었다. 최근 들어 좋은 성과를 보이는 것으로 보고되고 있는 기계학습 방법론들은 다양한 도메인을 기반으로한 대규모의 학습데이터를 필요로 한다. 본 연구에서는 정교하게 설계된 개체명 사전과 다단어 개체명 시퀀스에 대한 언어자원을 바탕으로 하는 반자동으로 학습데이터를 생성하는 방법론을 제안하였다. 본 연구에서 제안된 개체명주석 코퍼스 DecoNAC 기반 접근법의 성능을 실험하기 위해 온라인 뉴스 기사 텍스트를 바탕으로 실험을 진행하였다. 이 실험에서 DecoNAC을 적용한 경우, KoBERT 모델만으로 개체명을 인식한 결과에 비해 약 7.49%의 성능향상을 기대할 수 있음을 확인하였다.

  • PDF

Korean Word Sense Disambiguation using Dictionary and Corpus (사전과 말뭉치를 이용한 한국어 단어 중의성 해소)

  • Jeong, Hanjo;Park, Byeonghwa
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.1
    • /
    • pp.1-13
    • /
    • 2015
  • As opinion mining in big data applications has been highlighted, a lot of research on unstructured data has made. Lots of social media on the Internet generate unstructured or semi-structured data every second and they are often made by natural or human languages we use in daily life. Many words in human languages have multiple meanings or senses. In this result, it is very difficult for computers to extract useful information from these datasets. Traditional web search engines are usually based on keyword search, resulting in incorrect search results which are far from users' intentions. Even though a lot of progress in enhancing the performance of search engines has made over the last years in order to provide users with appropriate results, there is still so much to improve it. Word sense disambiguation can play a very important role in dealing with natural language processing and is considered as one of the most difficult problems in this area. Major approaches to word sense disambiguation can be classified as knowledge-base, supervised corpus-based, and unsupervised corpus-based approaches. This paper presents a method which automatically generates a corpus for word sense disambiguation by taking advantage of examples in existing dictionaries and avoids expensive sense tagging processes. It experiments the effectiveness of the method based on Naïve Bayes Model, which is one of supervised learning algorithms, by using Korean standard unabridged dictionary and Sejong Corpus. Korean standard unabridged dictionary has approximately 57,000 sentences. Sejong Corpus has about 790,000 sentences tagged with part-of-speech and senses all together. For the experiment of this study, Korean standard unabridged dictionary and Sejong Corpus were experimented as a combination and separate entities using cross validation. Only nouns, target subjects in word sense disambiguation, were selected. 93,522 word senses among 265,655 nouns and 56,914 sentences from related proverbs and examples were additionally combined in the corpus. Sejong Corpus was easily merged with Korean standard unabridged dictionary because Sejong Corpus was tagged based on sense indices defined by Korean standard unabridged dictionary. Sense vectors were formed after the merged corpus was created. Terms used in creating sense vectors were added in the named entity dictionary of Korean morphological analyzer. By using the extended named entity dictionary, term vectors were extracted from the input sentences and then term vectors for the sentences were created. Given the extracted term vector and the sense vector model made during the pre-processing stage, the sense-tagged terms were determined by the vector space model based word sense disambiguation. In addition, this study shows the effectiveness of merged corpus from examples in Korean standard unabridged dictionary and Sejong Corpus. The experiment shows the better results in precision and recall are found with the merged corpus. This study suggests it can practically enhance the performance of internet search engines and help us to understand more accurate meaning of a sentence in natural language processing pertinent to search engines, opinion mining, and text mining. Naïve Bayes classifier used in this study represents a supervised learning algorithm and uses Bayes theorem. Naïve Bayes classifier has an assumption that all senses are independent. Even though the assumption of Naïve Bayes classifier is not realistic and ignores the correlation between attributes, Naïve Bayes classifier is widely used because of its simplicity and in practice it is known to be very effective in many applications such as text classification and medical diagnosis. However, further research need to be carried out to consider all possible combinations and/or partial combinations of all senses in a sentence. Also, the effectiveness of word sense disambiguation may be improved if rhetorical structures or morphological dependencies between words are analyzed through syntactic analysis.

The Use of National Names for International Bodies of Water: Critical Perspective (공해(公海)에 대한 국가지명 사용: 비판적 관점)

  • 알렉산더B.머피
    • Journal of the Korean Geographical Society
    • /
    • v.34 no.5
    • /
    • pp.507-516
    • /
    • 1999
  • More than twenty-five major international bodies of water bear the names of particular nations or states. Many of these are not names are widely accepted, but considerable disagreement has developed in some cases. A systematic examination of the level of conflict over the use of national names for international bodies of water indicates that conflict is most likely to develop where shifting power relations among interested states produce concern about the hegemonic ambitions of the state after which the body of water is named. This is the case in the three situations where considerable contention exists over the use of a national name for an international body of water: the Persian Gulf/Arabian Sea, the Sea of Japan/East Sea, and the South China SealBien Dong. Cases evidencing little contention are those where either no state has a significant interest in the naming issue, or where the name that is attached to the body of water is that of a state that has not been a historic threat to others in the region. Naming international bodies of water after nations or states is potentially problematic because such appellations can connote ownership or control by a single people or political entity. An understanding of the controversies surrounding these place names requires consideration of the geopolitical context in which they are embedded.

  • PDF

Provider Provisioned based Mobile VPN using Dynamic VPN Site Configuration (동적 VPN 사이트 구성을 이용한 Provider Provisioned 기반 모바일 VPN)

  • Byun, Hae-Sun;Lee, Mee-Jeong
    • Journal of KIISE:Information Networking
    • /
    • v.34 no.1
    • /
    • pp.1-15
    • /
    • 2007
  • Increase in the wireless mobile network users brings the issue of mobility management into the Virtual Private Network (VPN) services. We propose a provider edge (PE)-based provider provisioned mobile VPN mechanism, which enables efficient communication between a mobile VPN user and one or more correspondents located in different VPN sites. The proposed mechanism not only reduces the IPSec tunnel overhead at the mobile user node to the minimum, but also enables the traffic to be delivered through optimized paths among the (mobile) VPN users without incurring significant extra IPSec tunnel overhead regardless of the user's locations. The proposed architecture and protocols are based on the BGP/MPLS VPN technology that is defined in RFC24547. A service provider platform entity named PPVPN Network Server (PNS) is defined in order to extend the BGP/MPLS VPN service to the mobile users. Compared to the user- and CE-based mobile VPN mechanisms, the proposed mechanism requires less overhead with respect to the IPSec tunnel management. The simulation results also show that it outperforms the existing mobile VPN mechanisms with respect to the handoff latency and/or the end-to-end packet delay.

A CASE OF TYPE II7 MIRIZZI SYNDROME (Type II Mirizzi 증후군 1례)

  • Kim, Hong-Jin;Lee, Joo-Hyeong;Shin, Myeong-Jun;Kwun, Koing-Bo;Chang, Jae-Chun;Chung, Moon-Kwan
    • Journal of Yeungnam Medical Science
    • /
    • v.7 no.2
    • /
    • pp.197-202
    • /
    • 1990
  • Mechanical obstruction of the common hepatic duct includes the following causes ; choledocholithiasis, sclerosis, cholangitis, pancreatic carcinoma, cholangiocarcinoma, postoperative stricture, primary hepatic duct carcinoma, enlarged cystic duct lymph nodes, and metastatic nodal involvement of the porta hepatis. Partial mechanical obstruction of the common hepatic duct caused by impaction of stones and inflammation surrounding the vicinity of the neck of the gallbladder had been reported on the "syndrome del conducto hepatico" in 1948 by Mirizzi. Nowadays, this disease was named by Mirizzi syndrome. Mirizzi syndrome is a rare entity of common hepatic duct obstruction that results from an inflammatory response secondary to a gallstone impacted in the cystic duct or neck of the gallbladder. It results from an almost parallel course and low insertion of the cystic duct into the common hepatic duct. In a varient of Mirizzi's syndrome, the cause of the common hepatic duct obstruction was a primary cystic duct carcinoma rather than gallstone disease. A 71-year-old man was admitted with a four-day history of right upper quadrant abdominal pain. Past medical history was unremarkable. On physical examination, the patient had a temperature of $38^{\circ}C$, icteric sclera and right upper quadrant tenderness. Pertinent laboratory findings included WBC 18,000/$cm^2$;albumin 2.6g/dl(normal 3.9-5.1) ; SGOT 183u/L(normal 0-50) ; SGPT167u/L(normal 0-65) ; bilirubin, 8.2mg/dl(normal 0-1) with the direct bilirubin, 4.4mg/dl(normal 0-0.4). Ultrasonography revealed a dilated extrahepatic biliary tree. ERCP showed that the superior margin was angular and more consistent with a calculus causing partial CHD obstruction(Mirizzi syndrome). At surgery a diseased gallbladder containing calculi was found. In addition, there was two calculi partially eroding through the proximal portion of the cystic duct and compressing the common hepatic duct. A cholecystectomy and excision of common bile duct was performed, with Roux-en-Y hepaticojejunostomy. The postoperative course was uneventful.

  • PDF

Outdoor Healing Places Perception Analysis Using Named Entity Recognition of Social Media Big Data (소셜미디어 빅데이터의 개체명 인식을 활용한 옥외 힐링 장소 인식 분석)

  • Sung, Junghan;Lee, Kyungjin
    • Journal of the Korean Institute of Landscape Architecture
    • /
    • v.50 no.5
    • /
    • pp.90-102
    • /
    • 2022
  • In recent years, as interest in healing increases, outdoor spaces with the concept of healing have been created. For more professional and in-depth planning and design, the perception and characteristics of outdoor healing places through social media posts were analyzed using NER. Text mining was conducted using 88,155 blog posts, and frequency analysis and clique cohesion analysis were conducted. Six elements were derived through a literature review, and two elements were added to analyze the perception and the characteristics of healing places. As a result, visitors considered place elements, date and time, social elements, and activity elements more important than personnel, psychological elements, plants and color, and form and shape when visiting healing places. The analysis allowed the derivation of perceptions and characteristics of healing places through keywords. From the results of the Clique, keywords, such as places, date and time, and relationship, were clustered, so it was possible to know where, when, what time, and with whom people were visiting places for healing. Through the study, the perception and characteristics of healing places were derived by analyzing large-scale data written by visitors. It was confirmed that specific elements could be used in planning and marketing.

Spatialization of Unstructured Document Information Using AI (AI를 활용한 비정형 문서정보의 공간정보화)

  • Sang-Won YOON;Jeong-Woo PARK;Kwang-Woo NAM
    • Journal of the Korean Association of Geographic Information Studies
    • /
    • v.26 no.3
    • /
    • pp.37-51
    • /
    • 2023
  • Spatial information is essential for interpreting urban phenomena. Methodologies for spatializing urban information, especially when it lacks location details, have been consistently developed. Typical methods include Geocoding using structured address information or place names, spatial integration with existing geospatial data, and manual tasks utilizing reference data. However, a vast number of documents produced by administrative agencies have not been deeply dealt with due to their unstructured nature, even when there's demand for spatialization. This research utilizes the natural language processing model BERT to spatialize public documents related to urban planning. It focuses on extracting sentence elements containing addresses from documents and converting them into structured data. The study used 18 years of urban planning public announcement documents as training data to train the BERT model and enhanced its performance by manually adjusting its hyperparameters. After training, the test results showed accuracy rates of 96.6% for classifying urban planning facilities, 98.5% for address recognition, and 93.1% for address cleaning. When mapping the result data on GIS, it was possible to effectively display the change history related to specific urban planning facilities. This research provides a deep understanding of the spatial context of urban planning documents, and it is hoped that through this, stakeholders can make more effective decisions.

Automatic Extraction of References for Research Reports using Deep Learning Language Model (딥러닝 언어 모델을 이용한 연구보고서의 참고문헌 자동추출 연구)

  • Yukyung Han;Wonsuk Choi;Minchul Lee
    • Journal of the Korean Society for information Management
    • /
    • v.40 no.2
    • /
    • pp.115-135
    • /
    • 2023
  • The purpose of this study is to assess the effectiveness of using deep learning language models to extract references automatically and create a reference database for research reports in an efficient manner. Unlike academic journals, research reports present difficulties in automatically extracting references due to variations in formatting across institutions. In this study, we addressed this issue by introducing the task of separating references from non-reference phrases, in addition to the commonly used metadata extraction task for reference extraction. The study employed datasets that included various types of references, such as those from research reports of a particular institution, academic journals, and a combination of academic journal references and non-reference texts. Two deep learning language models, namely RoBERTa+CRF and ChatGPT, were compared to evaluate their performance in automatic extraction. They were used to extract metadata, categorize data types, and separate original text. The research findings showed that the deep learning language models were highly effective, achieving maximum F1-scores of 95.41% for metadata extraction and 98.91% for categorization of data types and separation of the original text. These results provide valuable insights into the use of deep learning language models and different types of datasets for constructing reference databases for research reports including both reference and non-reference texts.

Improving Bidirectional LSTM-CRF model Of Sequence Tagging by using Ontology knowledge based feature (온톨로지 지식 기반 특성치를 활용한 Bidirectional LSTM-CRF 모델의 시퀀스 태깅 성능 향상에 관한 연구)

  • Jin, Seunghee;Jang, Heewon;Kim, Wooju
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.1
    • /
    • pp.253-266
    • /
    • 2018
  • This paper proposes a methodology applying sequence tagging methodology to improve the performance of NER(Named Entity Recognition) used in QA system. In order to retrieve the correct answers stored in the database, it is necessary to switch the user's query into a language of the database such as SQL(Structured Query Language). Then, the computer can recognize the language of the user. This is the process of identifying the class or data name contained in the database. The method of retrieving the words contained in the query in the existing database and recognizing the object does not identify the homophone and the word phrases because it does not consider the context of the user's query. If there are multiple search results, all of them are returned as a result, so there can be many interpretations on the query and the time complexity for the calculation becomes large. To overcome these, this study aims to solve this problem by reflecting the contextual meaning of the query using Bidirectional LSTM-CRF. Also we tried to solve the disadvantages of the neural network model which can't identify the untrained words by using ontology knowledge based feature. Experiments were conducted on the ontology knowledge base of music domain and the performance was evaluated. In order to accurately evaluate the performance of the L-Bidirectional LSTM-CRF proposed in this study, we experimented with converting the words included in the learned query into untrained words in order to test whether the words were included in the database but correctly identified the untrained words. As a result, it was possible to recognize objects considering the context and can recognize the untrained words without re-training the L-Bidirectional LSTM-CRF mode, and it is confirmed that the performance of the object recognition as a whole is improved.