• 제목/요약/키워드: Vocabulary learning

검색결과 186건 처리시간 0.023초

Improving methods for normalizing biomedical text entities with concepts from an ontology with (almost) no training data at BLAH5 the CONTES

  • Ferre, Arnaud;Ba, Mouhamadou;Bossy, Robert
    • Genomics & Informatics
    • /
    • 제17권2호
    • /
    • pp.20.1-20.5
    • /
    • 2019
  • Entity normalization, or entity linking in the general domain, is an information extraction task that aims to annotate/bind multiple words/expressions in raw text with semantic references, such as concepts of an ontology. An ontology consists minimally of a formally organized vocabulary or hierarchy of terms, which captures knowledge of a domain. Presently, machine-learning methods, often coupled with distributional representations, achieve good performance. However, these require large training datasets, which are not always available, especially for tasks in specialized domains. CONTES (CONcept-TErm System) is a supervised method that addresses entity normalization with ontology concepts using small training datasets. CONTES has some limitations, such as it does not scale well with very large ontologies, it tends to overgeneralize predictions, and it lacks valid representations for the out-of-vocabulary words. Here, we propose to assess different methods to reduce the dimensionality in the representation of the ontology. We also propose to calibrate parameters in order to make the predictions more accurate, and to address the problem of out-of-vocabulary words, with a specific method.

로치오 알고리즘을 이용한 학술지 논문의 디스크 립터 자동부여에 관한 연구 (A Study on the Automatic Descriptor Assignment for Scientific Journal Articles Using Rocchio Algorithm)

  • 김판준
    • 정보관리학회지
    • /
    • 제23권3호
    • /
    • pp.69-89
    • /
    • 2006
  • 로치오 알고리즘에 기초한 통제어휘 자동색인 또는 텍스트 범주화에서 적용되어 온 여러 성능 요인들을 재검토하였고, 성능 향상을 위한 기본적인 방법을 찾아보았다. 또한, 동등한 조건에서 통제어휘 자동색인을 위한 로치오 알고리즘 기반 방법의 성능을 다른 학습기반 방법들의 성능과 비교하였다. 결과에 따르면, 통제어휘 자동색인을 위한 로치오 기반의 프로파일 방법은 구현의 용이성과 컴퓨터 처리시간 측면의 경제성이라는 기존의 장점을 그대로 유지하면서도, 다른 학습기반 방법들(SVM, VPT, NB)과 거의 동등하거나 더 나은 성능을 보여주었다. 특히, 색인전문가의 색인작업을 지원하는 반-자동 색인의 목적으로는 비교적 높은 수준의 재현율을 유지하면서 학습 데이터의 증가에 따라 정확률이 크게 향상되는 로치오 알고리즘을 이용한 방법을 우선적으로 고려할 수 있을 것이다.

디스크립터 자동 할당을 위한 저자키워드의 재분류에 관한 실험적 연구 (A Study on the Reclassification of Author Keywords for Automatic Assignment of Descriptors)

  • 김판준;이재윤
    • 정보관리학회지
    • /
    • 제29권2호
    • /
    • pp.225-246
    • /
    • 2012
  • 본 연구는 국내 주요 학술 DB의 검색서비스에서 제공되고 있는 저자키워드(비통제키워드)의 재분류를 통하여 디스크립터(통제키워드)를 자동 할당할 수 있는 가능성을 모색하였다. 먼저 기계학습에 기반한 주요 분류기들의 특성을 비교하는 실험을 수행하여 재분류를 위한 최적 분류기와 파라미터를 선정하였다. 다음으로, 국내 독서 분야 학술지 논문들에 부여된 저자키워드를 학습한 결과에 따라 해당 논문들을 재분류함으로써 키워드를 추가로 할당하는 실험을 수행하였다. 또한 이러한 재분류 결과에 따라 새롭게 추가된 문헌들에 대하여 통제키워드인 디스크립터와 마찬가지로 동일 주제의 논문들을 모아주는 어휘통제 효과가 있는지를 살펴보았다. 그 결과, 저자키워드의 재분류를 통하여 디스크립터를 자동 할당하는 효과를 얻을 수 있음을 확인하였다.

KOREAN CONSONANT RECOGNITION USING A MODIFIED LVQ2 METHOD

  • Makino, Shozo;Okimoto, Yoshiyuki;Kido, Ken'iti;Kim, Hoi-Rin;Lee, Yong-Ju
    • 한국음향학회:학술대회논문집
    • /
    • 한국음향학회 1994년도 FIFTH WESTERN PACIFIC REGIONAL ACOUSTICS CONFERENCE SEOUL KOREA
    • /
    • pp.1033-1038
    • /
    • 1994
  • This paper describes recognition results using the modified Learning Vector Quantization (MLVQ2) method which we proposed previously. At first, we investigated the property of duration of 29 Korean consonants and found that the variances of th duration were extremely big comparing to other languages. We carried out preliminary recognition experiments for three stop consonants P, T and K. From the recognition results, we defined the optimum conditions for the learning. Then we applied the MLVQ2 method to the recognition of Korean consonants. The training was carried out using the phoneme samples in the 611 word vocabulary uttered by 2 male speakers, where each of the speakers uttered two repetitions. The recognition experiment was carried out for the phoneme samples in two repetitions of the 611 word vocabulary uttered by another male speaker. The recognition scores for the twelve plosives were 68.2% for the test samples. The recofnition scores for the 29 Korean consonants were 64.8% for the test samples.

  • PDF

Examining the Effects of Vocabulary on Crowdfunding Success: A Comparison of Cultural and Commercial Campaigns

  • Xiang Gao;Weige Huang;Bin, Li;Sunghan Ryu
    • Asia pacific journal of information systems
    • /
    • 제32권2호
    • /
    • pp.275-306
    • /
    • 2022
  • Crowdfunding has emerged as an important financing source for diverse cultural projects and commercial ventures in the early stages. Unlike traditional investment evaluation, where structured financial data is critical, such information is typically unavailable for crowdfunding campaigns. Instead, campaign creators prepare pitches containing essential information about themselves and the campaigns, which are crucial in attracting and persuading contributors. Prior literature has examined the effects of different aspects in campaign pitches, but a comprehensive understanding of the theme is lacking. This study aims to fill this gap by identifying the lexicon of frequently used vocabulary in campaign pitches and examining how they are associated with crowdfunding success. Moreover, we examine how the association differs between culture and commercial crowdfunding campaigns. We randomly collected 50,000 campaigns from the cultural and commercial categories on Kickstarter and extracted the 100 most used verbs in the campaign pitches. Based on a machine learning approach combined with principal component analysis, we constructed sets of verbal factors statistically significant in predicting crowdfunding success. The findings also show that cultural and commercial campaigns consist of different verbal components with different effects on crowdfunding success.

A Machine Learning Approach to Korean Language Stemming

  • Cho, Se-hyeong
    • 한국지능시스템학회논문지
    • /
    • 제11권6호
    • /
    • pp.549-557
    • /
    • 2001
  • Morphological analysis and POS tagging require a dictionary for the language at hand . In this fashion though it is impossible to analyze a language a dictionary. We also have difficulty if significant portion of the vocabulary is new or unknown . This paper explores the possibility of learning morphology of an agglutinative language. in particular Korean language, without any prior lexical knowledge of the language. We use unsupervised learning in that there is no instructor to guide the outcome of the learner, nor any tagged corpus. Here are the main characteristics of the approach: First. we use only raw corpus without any tags attached or any dictionary. Second, unlike many heuristics that are theoretically ungrounded, this method is based on statistical methods , which are widely accepted. The method is currently applied only to Korean language but since it is essentially language-neutral it can easily be adapted to other agglutinative languages.

  • PDF

A Computational Model of Language Learning Driven by Training Inputs

  • 이은석;이지훈;장병탁
    • 한국인지과학회:학술대회논문집
    • /
    • 한국인지과학회 2010년도 춘계학술대회
    • /
    • pp.60-65
    • /
    • 2010
  • Language learning involves linguistic environments around the learner. So the variation in training input to which the learner is exposed has been linked to their language learning. We explore how linguistic experiences can cause differences in learning linguistic structural features, as investigate in a probabilistic graphical model. We manipulate the amounts of training input, composed of natural linguistic data from animation videos for children, from holistic (one-word expression) to compositional (two- to six-word one) gradually. The recognition and generation of sentences are a "probabilistic" constraint satisfaction process which is based on massively parallel DNA chemistry. Random sentence generation tasks succeed when networks begin with limited sentential lengths and vocabulary sizes and gradually expand with larger ones, like children's cognitive development in learning. This model supports the suggestion that variations in early linguistic environments with developmental steps may be useful for facilitating language acquisition.

  • PDF

문맥 및 어휘 그룹 기반의 지능형 영어 어휘 학습 시스템의 개발 (Development of Context and Vocabulary Group-Based Intelligent English Vocabulary Learning System)

  • 김도현;장홍준;김병욱
    • 한국정보처리학회:학술대회논문집
    • /
    • 한국정보처리학회 2023년도 추계학술발표대회
    • /
    • pp.19-20
    • /
    • 2023
  • 영어 교육 시장 확대로 다양한 영어 학습 시스템이 개발되고 있다. 그러나 어휘의 문맥적 이해와 효과적인 학습 방법을 결합한 지능형 어휘 학습 시스템에 대한 연구는 미비하다. 본 연구에서는 임의의 n 개 영어 단어가 한 그룹으로 제시되고, 이들을 모두 포함한 예문을 제공하는 지능형 영어 어휘 학습 시스템을 개발한다. 본 연구에서는 임의의 n 개 영어 단어가 주어졌을 때 문맥에 맞는 영어 예문을 자동으로 생성하는 모델을 개발하였다. 어휘 평가를 바탕으로 자동으로 취약 어휘를 선정하며 학습자들이 해당 어휘를 학습 할 수 있도록 진행한다. 본 연구에서 개발한 지능형 영어 어휘 학습 시스템의 사용성 평가를 위해 설문 검사를 실시하였다. 설문 결과는 문맥 및 어휘 그룹 기반의 지능형 영어 학습 시스템은 사용자들이 사용하기 편리하고 어휘 능력을 향상시키는데 도움이 될 수 있음을 보여준다.

Effective Acoustic Model Clustering via Decision Tree with Supervised Decision Tree Learning

  • Park, Jun-Ho;Ko, Han-Seok
    • 음성과학
    • /
    • 제10권1호
    • /
    • pp.71-84
    • /
    • 2003
  • In the acoustic modeling for large vocabulary speech recognition, a sparse data problem caused by a huge number of context-dependent (CD) models usually leads the estimated models to being unreliable. In this paper, we develop a new clustering method based on the C45 decision-tree learning algorithm that effectively encapsulates the CD modeling. The proposed scheme essentially constructs a supervised decision rule and applies over the pre-clustered triphones using the C45 algorithm, which is known to effectively search through the attributes of the training instances and extract the attribute that best separates the given examples. In particular, the data driven method is used as a clustering algorithm while its result is used as the learning target of the C45 algorithm. This scheme has been shown to be effective particularly over the database of low unknown-context ratio in terms of recognition performance. For speaker-independent, task-independent continuous speech recognition task, the proposed method reduced the percent accuracy WER by 3.93% compared to the existing rule-based methods.

  • PDF

Evaluations of AI-based malicious PowerShell detection with feature optimizations

  • Song, Jihyeon;Kim, Jungtae;Choi, Sunoh;Kim, Jonghyun;Kim, Ikkyun
    • ETRI Journal
    • /
    • 제43권3호
    • /
    • pp.549-560
    • /
    • 2021
  • Cyberattacks are often difficult to identify with traditional signature-based detection, because attackers continually find ways to bypass the detection methods. Therefore, researchers have introduced artificial intelligence (AI) technology for cybersecurity analysis to detect malicious PowerShell scripts. In this paper, we propose a feature optimization technique for AI-based approaches to enhance the accuracy of malicious PowerShell script detection. We statically analyze the PowerShell script and preprocess it with a method based on the tokens and abstract syntax tree (AST) for feature selection. Here, tokens and AST represent the vocabulary and structure of the PowerShell script, respectively. Performance evaluations with optimized features yield detection rates of 98% in both machine learning (ML) and deep learning (DL) experiments. Among them, the ML model with the 3-gram of selected five tokens and the DL model with experiments based on the AST 3-gram deliver the best performance.