• 제목/요약/키워드: annotated corpus

검색결과 46건 처리시간 0.021초

GNI Corpus Version 1.0: Annotated Full-Text Corpus of Genomics & Informatics to Support Biomedical Information Extraction

  • Oh, So-Yeon;Kim, Ji-Hyeon;Kim, Seo-Jin;Nam, Hee-Jo;Park, Hyun-Seok
    • Genomics & Informatics
    • /
    • 제16권3호
    • /
    • pp.75-77
    • /
    • 2018
  • Genomics & Informatics (NLM title abbreviation: Genomics Inform) is the official journal of the Korea Genome Organization. Text corpus for this journal annotated with various levels of linguistic information would be a valuable resource as the process of information extraction requires syntactic, semantic, and higher levels of natural language processing. In this study, we publish our new corpus called GNI Corpus version 1.0, extracted and annotated from full texts of Genomics & Informatics, with NLTK (Natural Language ToolKit)-based text mining script. The preliminary version of the corpus could be used as a training and testing set of a system that serves a variety of functions for future biomedical text mining.

Opinion: Strategy of Semi-Automatically Annotating a Full-Text Corpus of Genomics & Informatics

  • Park, Hyun-Seok
    • Genomics & Informatics
    • /
    • 제16권4호
    • /
    • pp.40.1-40.3
    • /
    • 2018
  • There is a communal need for an annotated corpus consisting of the full texts of biomedical journal articles. In response to community needs, a prototype version of the full-text corpus of Genomics & Informatics, called GNI version 1.0, has recently been published, with 499 annotated full-text articles available as a corpus resource. However, GNI needs to be updated, as the texts were shallow-parsed and annotated with several existing parsers. I list issues associated with upgrading annotations and give an opinion on the methodology for developing the next version of the GNI corpus, based on a semi-automatic strategy for more linguistically rich corpus annotation.

Building an Annotated English-Vietnamese Parallel Corpus for Training Vietnamese-related NLPs

  • Dien Dinh;Kiem Hoang
    • 대한전자공학회:학술대회논문집
    • /
    • 대한전자공학회 2004년도 ICEIC The International Conference on Electronics Informations and Communications
    • /
    • pp.103-109
    • /
    • 2004
  • In NLP (Natural Language Processing) tasks, the highest difficulty which computers had to face with, is the built-in ambiguity of Natural Languages. To disambiguate it, formerly, they based on human-devised rules. Building such a complete rule-set is time-consuming and labor-intensive task whilst it doesn't cover all the cases. Besides, when the scale of system increases, it is very difficult to control that rule-set. So, recently, many NLP tasks have changed from rule-based approaches into corpus-based approaches with large annotated corpora. Corpus-based NLP tasks for such popular languages as English, French, etc. have been well studied with satisfactory achievements. In contrast, corpus-based NLP tasks for Vietnamese are at a deadlock due to absence of annotated training data. Furthermore, hand-annotation of even reasonably well-determined features such as part-of-speech (POS) tags has proved to be labor intensive and costly. In this paper, we present our building an annotated English-Vietnamese parallel aligned corpus named EVC to train for Vietnamese-related NLP tasks such as Word Segmentation, POS-tagger, Word Order transfer, Word Sense Disambiguation, English-to-Vietnamese Machine Translation, etc.

  • PDF

가상 예제와 Edit-distance 자질을 이용한 SVM 기반의 단백질명 인식 (SVM-based Protein Name Recognition using Edit-Distance Features Boosted by Virtual Examples)

  • Yi, Eun-Ji;Lee, Gary-Geunbae;Park, Soo-Jun
    • 한국생물정보학회:학술대회논문집
    • /
    • 한국생물정보시스템생물학회 2003년도 제2차 연례학술대회 발표논문집
    • /
    • pp.95-100
    • /
    • 2003
  • In this paper, we propose solutions to resolve the problem of many spelling variants and the problem of lack of annotated corpus for training, which are two among the main difficulties in named entity recognition in biomedical domain. To resolve the problem of spotting valiants, we propose a use of edit-distance as a feature for SVM. And we propose a use of virtual examples to automatically expand the annotated corpus to resolve the lack-of-corpus problem. Using virtual examples, the annotated corpus can be extended in a fast, efficient and easy way. The experimental results show that the introduction of edit-distance produces some improvements in protein name recognition performance. And the model, which is trained with the corpus expanded by virtual examples, outperforms the model trained with the original corpus. According to the proposed methods, we finally achieve the performance 75.80 in F-measure(71.89% in precision,80.15% in recall) in the experiment of protein name recognition on GENIA corpus (ver.3.0).

  • PDF

A Study on the Diachronic Evolution of Ancient Chinese Vocabulary Based on a Large-Scale Rough Annotated Corpus

  • Yuan, Yiguo;Li, Bin
    • 아시아태평양코퍼스연구
    • /
    • 제2권2호
    • /
    • pp.31-41
    • /
    • 2021
  • This paper makes a quantitative analysis of the diachronic evolution of ancient Chinese vocabulary by constructing and counting a large-scale rough annotated corpus. The texts from Si Ku Quan Shu (a collection of Chinese ancient books) are automatically segmented to obtain ancient Chinese vocabulary with time information, which is used to the statistics on word frequency, standardized type/token ratio and proportion of monosyllabic words and dissyllabic words. Through data analysis, this study has the following four findings. Firstly, the high-frequency words in ancient Chinese are stable to a certain extent. Secondly, there is no obvious dissyllabic trend in ancient Chinese vocabulary. Moreover, the Northern and Southern Dynasties (420-589 AD) and Yuan Dynasty (1271-1368 AD) are probably the two periods with the most abundant vocabulary in ancient Chinese. Finally, the unique words with high frequency in each dynasty are mainly official titles with real power. These findings break away from qualitative methods used in traditional researches on Chinese language history and instead uses quantitative methods to draw macroscopic conclusions from large-scale corpus.

Semi-Automatic Annotation Tool to Build Large Dependency Tree-Tagged Corpus

  • Park, Eun-Jin;Kim, Jae-Hoon;Kim, Chang-Hyun;Kim, Young-Kill
    • 한국언어정보학회:학술대회논문집
    • /
    • 한국언어정보학회 2007년도 정기학술대회
    • /
    • pp.385-393
    • /
    • 2007
  • Corpora annotated with lots of linguistic information are required to develop robust and statistical natural language processing systems. Building such corpora, however, is an expensive, labor-intensive, and time-consuming work. To help the work, we design and implement an annotation tool for establishing a Korean dependency tree-tagged corpus. Compared with other annotation tools, our tool is characterized by the following features: independence of applications, localization of errors, powerful error checking, instant annotated information sharing, user-friendly. Using our tool, we have annotated 100,904 Korean sentences with dependency structures. The number of annotators is 33, the average annotation time is about 4 minutes per sentence, and the total period of the annotation is 5 months. We are confident that we can have accurate and consistent annotations as well as reduced labor and time.

  • PDF

커널 Ripple-Down Rule을 이용한 태깅 말뭉치 오류 자동 수정 (Automatic Correction of Errors in Annotated Corpus Using Kernel Ripple-Down Rules)

  • 박태호;차정원
    • 정보과학회 논문지
    • /
    • 제43권6호
    • /
    • pp.636-644
    • /
    • 2016
  • 자연어처리에서 기계학습을 위한 학습 말뭉치는 매우 중요하다. 정제된 대량의 말뭉치는 자연어처리 시스템에 직접 영향을 준다. 본 논문에서는 대량의 말뭉치 오류를 자동으로 수정하는 새로운 방법을 제안한다. 오류 말뭉치와 정답 말뭉치에서 사람이 태깅한 문서의 특성을 반영한 수정 규칙을 자동으로 생성하였다. 수정 규칙은 RDR(Ripple-Down Rules)를 사용하여 표현하였다. 수정 방법의 가치를 보이기 위해 품사 부착 말뭉치와 개체명 부착 말뭉치에 대해서 실험하였으며 두 분야에서 유의미한 결과를 보였다. 이 방법은 대량의 말뭉치를 제작할 때 오류를 최소화하는 방법으로 사용이 가능하다.

Lessons from Developing an Annotated Corpus of Patient Histories

  • Rost, Thomas Brox;Huseth, Ola;Nytro, Oystein;Grimsmo, Anders
    • Journal of Computing Science and Engineering
    • /
    • 제2권2호
    • /
    • pp.162-179
    • /
    • 2008
  • We have developed a tool for annotation of electronic health record (EHR) data. Currently we are in the process of manually annotating a corpus of Norwegian general practitioners' EHRs with mainly linguistic information. The purpose of this project is to attain a linguistically annotated corpus of patient histories from general practice. This corpus will be put to future use in medical language processing and information extraction applications. The paper outlines some of our practical experiences from developing such a corpus and, in particular, the effects of semi-automated annotation. We have also done some preliminary experiments with part-of-speech tagging based on our corpus. The results indicated that relevant training data from the clinical domain gives better results for the tagging task in this domain than training the tagger on a corpus form a more general domain. We are planning to expand the corpus annotations with medical information at a later stage.

교육용 과학언어 연구를 위한 범용 자료로서 과학교과서 말뭉치 K-STeC(Korean Science Textbook Corpus) 구축 (Building Korean Science Textbook Corpus (K-STeC) for research of Scientific Language in Education)

  • 윤은정;김진호;남길임;송현주;옥철영;최준;박윤배
    • 한국과학교육학회지
    • /
    • 제38권4호
    • /
    • pp.575-585
    • /
    • 2018
  • 본 연구에서는 과학교육에서 그 동안 주목받지 못했던 과학언어 및 과학용어에 대한 연구를 체계적으로 수행하기 위한 목적으로 지난 20년간의 과학교과서 텍스트를 한 자리에 모아 과학교과서 말뭉치를 구축함으로써 다각도로 분석 가능한 형태의 언어 자원을 생성하였다. 말뭉치 구축 대상 자료는 6차 교육과정, 7차 교육과정, 2009 개정교육과정의 초등학교에서부터 고등학교까지 모든 과학교과서를 수집하고 이 가운데 두 개의 출판사에 해당하는 132권에 대한 말뭉치를 구축하였다. 원시말뭉치, 형태주석 말뭉치, 용어주석 말뭉치의 총 3단계로 구축하였다. 최종적으로 구축된 과학교과서 말뭉치를 K-STeC(Korea - Science Textbook Corpus)이라 명명하였다. K-STeC은 과학용어에 대한 의미 구분과 분야가 표지된 의미 주석 말뭉치로서 교육과정, 과목, 학년, 출판사의 서지 정보와 대단원, 중단원, 소단원의 단원 정보, 페이지, 문장번호의 위치 정보와 함께 본문, 탐구활동, 참고자료, 제목 등의 텍스트 구조 정보를 메타정보로 마크업 하였다. 총 3년여에 걸친 연구 기간 동안 언어정보학, 컴퓨터공학, 과학교육학의 세 분야 전문가들의 노하우를 융합하여 새로운 연구 방법을 창출하였고, 다수의 전문 인력들이 투입되어 노동집약적 결과물을 내었다. 본 원고에서는 전체적인 연구 절차와 방법을 조망함으로써 새로운 연구 방법론 및 결과물을 소개하고 향후 과학언어 연구의 발전 가능성 및 결과물의 활용방안에 대해 논의하였다.

Corpus-based analysis of the usage of Korean markers -(n)un and -i/ka in editorial texts

  • Kim, Kyoung-Young
    • 한국언어정보학회지:언어와정보
    • /
    • 제19권2호
    • /
    • pp.19-36
    • /
    • 2015
  • The aim of this paper is to investigate the usage of Korean markers -(n)un and -i/ka in editorial texts focusing on information structure. Noun phrases ending with the markers -(n)un and -i/ka were annotated semi-automatically using a corpus obtained from an online newspaper. Two important factors to determine the choice of markers were examined with the annotated data: referential givenness/newness and position in a sentence. Referential givenness and newness were adopted as indicators of information structure, topic and focus respectively. In addition to quantitative analysis, qualitative analysis was conducted on the selected data. The results suggest that both the marker -(n)un and -i/ka could carry a topic and a focus reading. Sentence position also played a crucial role in determining the marker, and the marker -i/ka was used more frequently in a later position of a sentence than the marker -(n)un.

  • PDF