• Title/Summary/Keyword: Novel Corpus

Search Result 22, Processing Time 0.026 seconds

KONG-DB: Korean Novel Geo-name DB & Search and Visualization System Using Dictionary from the Web (KONG-DB: 웹 상의 어휘 사전을 활용한 한국 소설 지명 DB, 검색 및 시각화 시스템)

  • Park, Sung Hee
    • Journal of the Korean Society for information Management
    • /
    • v.33 no.3
    • /
    • pp.321-343
    • /
    • 2016
  • This study aimed to design a semi-automatic web-based pilot system 1) to build a Korean novel geo-name, 2) to update the database using automatic geo-name extraction for a scalable database, and 3) to retrieve/visualize the usage of an old geo-name on the map. In particular, the problem of extracting novel geo-names, which are currently obsolete, is difficult to solve because obtaining a corpus used for training dataset is burden. To build a corpus for training data, an admin tool, HTML crawler and parser in Python, crawled geo-names and usages from a vocabulary dictionary for Korean New Novel enough to train a named entity tagger for extracting even novel geo-names not shown up in a training corpus. By means of a training corpus and an automatic extraction tool, the geo-name database was made scalable. In addition, the system can visualize the geo-name on the map. The work of study also designed, implemented the prototype and empirically verified the validity of the pilot system. Lastly, items to be improved have also been addressed.

Hereditary spastic paraplegia with thin corpus callosum due to novel homozygous mutation in SPG11 gene

  • Kang, Sa-Yoon;Kim, Joong Goo;Oh, Jung Hwhan
    • Annals of Clinical Neurophysiology
    • /
    • v.22 no.2
    • /
    • pp.121-124
    • /
    • 2020
  • The most common form of autosomal recessive hereditary spastic paraplegia (HSP) is caused by mutations in SPG11/KIAA1840 gene, which encodes for spatacsin. The clinical presentation of SPG11 is characterized by cognitive impairment, peripheral neuropathy and a thin corpus callosum in brain magnetic resonance imaging. We identified a novel homozygous nonsense mutation (c.6082C>T [p.Q2028]) in exon 32 of SPG11 in Korean siblings. Our findings suggest that this novel homozygous mutation in SPG11 is associated with HSP and with dysgenesis of the corpus callosum.

A Novel Theory of Support in Social Media Discourse

  • Solomon, Bazil Stanley
    • Asia Pacific Journal of Corpus Research
    • /
    • v.1 no.1
    • /
    • pp.95-125
    • /
    • 2020
  • This paper aims to inform people how to support each other on social media. It alludes to an architecture for social media discourse and proposes a novel theory of support in social media discourse. It makes a methodological contribution. It combines predominately artificial intelligence with corpus linguistics analysis. It is on a large-scale dataset of anonymised diabetes-related user's posts from the Facebook platform. Log-likelihood and precision measures help with validation. A multi-method approach with Discourse Analysis helps in understanding any potential patterns. People living with Diabetes are found to employ sophisticated high-frequency patterns of device-enabled categories of purpose and content. It is with, for example, linguistic forms of Advice with stance-taking and targets such as Diabetes amongst other interactional ways. There can be uncertainty and variation of effect displayed when sharing information for support. The implications of the new theory aim at healthcare communicators, corpus linguists and with preliminary work for AI support-bots. These bots may be programmed to utilise the language patterns to support people who need them automatically.

This study revises Lee Hyo-seok's The Buckwheat Season, utilizing Novel Corpus, intermediate learners' level (소설텍스트의 난이도 조정 방안 연구 -이효석의 「메밀꽃 필 무렵」을 중심으로-)

  • Hwang, Hye ran
    • Journal of Korean language education
    • /
    • v.29 no.4
    • /
    • pp.255-294
    • /
    • 2018
  • The Buckwheat Season, evaluated as the best of Lee Hyo-seok's literature, is one of the short stories that represent Korean literature. However, vivid literary expressions such as lyrical and beautiful depictions, figurative expressions and dialects, which show the Korean beauty, rather make learners have difficulty and become a factor that fails in reading comprehension. Thus, it is necessary to revise and present the text modified for the learners' language level. The methods of revising a literary text include the revision of linguistic elements such as cryptic vocabulary or sentence structure and the revision of the composition of the text, e.g. suggestion of characters or plot, or insertion of illustration. The methods of revising the language of the text can be divided into methods of simplification and detailing. However, in the process of revising the text, many depend on the adapter's subjective perception, not revising it with objective criteria. This paper revised the text, utilizing by the Academy of Korean Studies, , and the by the National Institute of Korean Language to secure objectivity in revising the text.

Extracting Multiword Sentiment Expressions by Using a Domain-Specific Corpus and a Seed Lexicon

  • Lee, Kong-Joo;Kim, Jee-Eun;Yun, Bo-Hyun
    • ETRI Journal
    • /
    • v.35 no.5
    • /
    • pp.838-848
    • /
    • 2013
  • This paper presents a novel approach to automatically generate Korean multiword sentiment expressions by using a seed sentiment lexicon and a large-scale domain-specific corpus. A multiword sentiment expression consists of a seed sentiment word and its contextual words occurring adjacent to the seed word. The multiword sentiment expressions that are the focus of our study have a different polarity from that of the seed sentiment word. The automatically extracted multiword sentiment expressions show that 1) the contextual words should be defined as a part of a multiword sentiment expression in addition to their corresponding seed sentiment word, 2) the identified multiword sentiment expressions contain various indicators for polarity shift that have rarely been recognized before, and 3) the newly recognized shifters contribute to assigning a more accurate polarity value. The empirical result shows that the proposed approach achieves improved performance of the sentiment analysis system that uses an automatically generated lexicon.

Detecting Errors in POS-Tagged Corpus on XGBoost and Cross Validation (XGBoost와 교차검증을 이용한 품사부착말뭉치에서의 오류 탐지)

  • Choi, Min-Seok;Kim, Chang-Hyun;Park, Ho-Min;Cheon, Min-Ah;Yoon, Ho;Namgoong, Young;Kim, Jae-Kyun;Kim, Jae-Hoon
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.9 no.7
    • /
    • pp.221-228
    • /
    • 2020
  • Part-of-Speech (POS) tagged corpus is a collection of electronic text in which each word is annotated with a tag as the corresponding POS and is widely used for various training data for natural language processing. The training data generally assumes that there are no errors, but in reality they include various types of errors, which cause performance degradation of systems trained using the data. To alleviate this problem, we propose a novel method for detecting errors in the existing POS tagged corpus using the classifier of XGBoost and cross-validation as evaluation techniques. We first train a classifier of a POS tagger using the POS-tagged corpus with some errors and then detect errors from the POS-tagged corpus using cross-validation, but the classifier cannot detect errors because there is no training data for detecting POS tagged errors. We thus detect errors by comparing the outputs (probabilities of POS) of the classifier, adjusting hyperparameters. The hyperparameters is estimated by a small scale error-tagged corpus, in which text is sampled from a POS-tagged corpus and which is marked up POS errors by experts. In this paper, we use recall and precision as evaluation metrics which are widely used in information retrieval. We have shown that the proposed method is valid by comparing two distributions of the sample (the error-tagged corpus) and the population (the POS-tagged corpus) because all detected errors cannot be checked. In the near future, we will apply the proposed method to a dependency tree-tagged corpus and a semantic role tagged corpus.

Mining Parallel Text from the Web based on Sentence Alignment

  • Li, Bo;Liu, Juan;Zhu, Huili
    • Proceedings of the Korean Society for Language and Information Conference
    • /
    • 2007.11a
    • /
    • pp.285-292
    • /
    • 2007
  • The parallel corpus is an important resource in the research field of data-driven natural language processing, but there are only a few parallel corpora publicly available nowadays, mostly due to the high labor force needed to construct this kind of resource. A novel strategy is brought out to automatically fetch parallel text from the web in this paper, which may help to solve the problem of the lack of parallel corpora with high quality. The system we develop first downloads the web pages from certain hosts. Then candidate parallel page pairs are prepared from the page set based on the outer features of the web pages. The candidate page pairs are evaluated in the last step in which the sentences in the candidate web page pairs are extracted and aligned first, and then the similarity of the two web pages is evaluate based on the similarities of the aligned sentences. The experiments towards a multilingual web site show the satisfactory performance of the system.

  • PDF

A Novel Text to Image Conversion Method Using Word2Vec and Generative Adversarial Networks

  • LIU, XINRUI;Joe, Inwhee
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2019.05a
    • /
    • pp.401-403
    • /
    • 2019
  • In this paper, we propose a generative adversarial networks (GAN) based text-to-image generating method. In many natural language processing tasks, which word expressions are determined by their term frequency -inverse document frequency scores. Word2Vec is a type of neural network model that, in the case of an unlabeled corpus, produces a vector that expresses semantics for words in the corpus and an image is generated by GAN training according to the obtained vector. Thanks to the understanding of the word we can generate higher and more realistic images. Our GAN structure is based on deep convolution neural networks and pixel recurrent neural networks. Comparing the generated image with the real image, we get about 88% similarity on the Oxford-102 flowers dataset.

A Language Model Approach to "The Vegetarian" (채식주의자: 랭귀지 모델 접근)

  • Kim, Jaejun;Kwon, Junhyeok;Kim, Yoolae;Park, Myung-Kwan;Song, Sanghoun
    • Annual Conference on Human and Language Technology
    • /
    • 2017.10a
    • /
    • pp.260-263
    • /
    • 2017
  • This paper is to broaden the possible spectrums of analyzing the Korean-written novel "The Vegetarian" by using the computational linguistics program. Through the use of language model, which was usually used in bi-gram analysis in corpus linguistics, to the International Man Booker award winning novel, the characteristics of "The Vegetarian" is investigated by comparing it to the English-written novel "A Little Life".

  • PDF

A Language Model Approach to "The Vegetarian" (채식주의자: 랭귀지 모델 접근)

  • Kim, Jaejun;Kwon, Junhyeok;Kim, Yoolae;Park, Myung-Kwan;Song, Sanghoun
    • 한국어정보학회:학술대회논문집
    • /
    • 2017.10a
    • /
    • pp.260-263
    • /
    • 2017
  • This paper is to broaden the possible spectrums of analyzing the Korean-written novel "The Vegetarian" by using the computational linguistics program. Through the use of language model, which was usually used in bi-gram analysis in corpus linguistics, to the International Man Booker award winning novel, the characteristics of "The Vegetarian" is investigated by comparing it to the English-written novel "A Little Life".

  • PDF