• Title/Summary/Keyword: 단어 선정

Search Result 222, Processing Time 0.028 seconds

Named Entity Tagged Corpus Augmentation Using Co-hyponym Replacement (형제어 대체를 이용한 개체명 말뭉치 확장)

  • Kim, Jae-Kyun;Kim, Chang-Hyun;Cheon, Min-Ah;Park, Hyuk-Ro;Kim, Jae-Hoon
    • Annual Conference on Human and Language Technology
    • /
    • 2020.10a
    • /
    • pp.179-183
    • /
    • 2020
  • 말뭉치는 기계학습 및 심층학습을 위한 필수 자원이다. 한국어 개체명의 경우 학습에 사용할 잘 정제된 개체명 부착 말뭉치가 충분하지 않다. 말뭉치 정제 작업은 시간적, 경제적으로 많은 비용이 소모된다. 따라서 본 논문에서는 적은 양의 말뭉치를 이용하여 말뭉치를 자동적으로 확장하는 방법을 제안한다. 특별히 소규모 말뭉치에 속하는 문장의 단어에 대한 형제어들을 선정하여 형제어의 확률추출을 기반으로 대체함으로써 새로운 문장을 생성함으로써 말뭉치 확장하는 방법이다. 본 논문에서는 확장된 말뭉치를 이용해서 대부분의 시스템에서 성능이 향상됨을 확인할 수 있었다. 앞으로 단어의 삭제 및 삽입 등 다양한 방법으로 좀 더 다양한 문장을 생성할 수 있을 것으로 생각합니다.

  • PDF

Development of Context and Vocabulary Group-Based Intelligent English Vocabulary Learning System (문맥 및 어휘 그룹 기반의 지능형 영어 어휘 학습 시스템의 개발)

  • Do-Hyeon Kim;Hong-Jun Jang;Byoungwook Kim
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.11a
    • /
    • pp.19-20
    • /
    • 2023
  • 영어 교육 시장 확대로 다양한 영어 학습 시스템이 개발되고 있다. 그러나 어휘의 문맥적 이해와 효과적인 학습 방법을 결합한 지능형 어휘 학습 시스템에 대한 연구는 미비하다. 본 연구에서는 임의의 n 개 영어 단어가 한 그룹으로 제시되고, 이들을 모두 포함한 예문을 제공하는 지능형 영어 어휘 학습 시스템을 개발한다. 본 연구에서는 임의의 n 개 영어 단어가 주어졌을 때 문맥에 맞는 영어 예문을 자동으로 생성하는 모델을 개발하였다. 어휘 평가를 바탕으로 자동으로 취약 어휘를 선정하며 학습자들이 해당 어휘를 학습 할 수 있도록 진행한다. 본 연구에서 개발한 지능형 영어 어휘 학습 시스템의 사용성 평가를 위해 설문 검사를 실시하였다. 설문 결과는 문맥 및 어휘 그룹 기반의 지능형 영어 학습 시스템은 사용자들이 사용하기 편리하고 어휘 능력을 향상시키는데 도움이 될 수 있음을 보여준다.

Research trends in statistics for domestic and international journal using paper abstract data (초록데이터를 활용한 국내외 통계학 분야 연구동향)

  • Yang, Jong-Hoon;Kwak, Il-Youp
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.2
    • /
    • pp.267-278
    • /
    • 2021
  • As time goes by, the amount of data is increasing regardless of government, business, domestic or overseas. Accordingly, research on big data is increasing in academia. Statistics is one of the major disciplines of big data research, and it will be interesting to understand the research trend of statistics through big data in the growing number of papers in statistics. In this study, we analyzed what studies are being conducted through abstract data of statistical papers in Korea and abroad. Research trends in domestic and international were analyzed through the frequency of keyword data of the papers, and the relationship between the keywords was visualized through the Word Embedding method. In addition to the keywords selected by the authors, words that are importantly used in statistical papers selected through Textrank were also visualized. Lastly, 10 topics were investigated by applying the LDA technique to the abstract data. Through the analysis of each topic, we investigated which research topics are frequently studied and which words are used importantly.

Ontology-based Automated Metadata Generation Considering Semantic Ambiguity (의미 중의성을 고려한 온톨로지 기반 메타데이타의 자동 생성)

  • Choi, Jung-Hwa;Park, Young-Tack
    • Journal of KIISE:Software and Applications
    • /
    • v.33 no.11
    • /
    • pp.986-998
    • /
    • 2006
  • There has been an increasing necessity of Semantic Web-based metadata that helps computers efficiently understand and manage an information increased with the growth of Internet. However, it seems inevitable to face some semantically ambiguous information when metadata is generated. Therefore, we need a solution to this problem. This paper proposes a new method for automated metadata generation with the help of a concept of class, in which some ambiguous words imbedded in information such as documents are semantically more related to others, by using probability model of consequent words. We considers ambiguities among defined concepts in ontology and uses the Hidden Markov Model to be aware of part of a named entity. First of all, we constrict a Markov Models a better understanding of the named entity of each class defined in ontology. Next, we generate the appropriate context from a text to understand the meaning of a semantically ambiguous word and solve the problem of ambiguities during generating metadata by searching the optimized the Markov Model corresponding to the sequence of words included in the context. We experiment with seven semantically ambiguous words that are extracted from computer science thesis. The experimental result demonstrates successful performance, the accuracy improved by about 18%, compared with SemTag, which has been known as an effective application for assigning a specific meaning to an ambiguous word based on its context.

A Study on the Deduction of Social Issues Applying Word Embedding: With an Empasis on News Articles related to the Disables (단어 임베딩(Word Embedding) 기법을 적용한 키워드 중심의 사회적 이슈 도출 연구: 장애인 관련 뉴스 기사를 중심으로)

  • Choi, Garam;Choi, Sung-Pil
    • Journal of the Korean Society for information Management
    • /
    • v.35 no.1
    • /
    • pp.231-250
    • /
    • 2018
  • In this paper, we propose a new methodology for extracting and formalizing subjective topics at a specific time using a set of keywords extracted automatically from online news articles. To do this, we first extracted a set of keywords by applying TF-IDF methods selected by a series of comparative experiments on various statistical weighting schemes that can measure the importance of individual words in a large set of texts. In order to effectively calculate the semantic relation between extracted keywords, a set of word embedding vectors was constructed by using about 1,000,000 news articles collected separately. Individual keywords extracted were quantified in the form of numerical vectors and clustered by K-means algorithm. As a result of qualitative in-depth analysis of each keyword cluster finally obtained, we witnessed that most of the clusters were evaluated as appropriate topics with sufficient semantic concentration for us to easily assign labels to them.

On the Development of a Large-Vocabulary Continuous Speech Recognition System for the Korean Language (대용량 한국어 연속음성인식 시스템 개발)

  • Choi, In-Jeong;Kwon, Oh-Wook;Park, Jong-Ryeal;Park, Yong-Kyu;Kim, Do-Yeong;Jeong, Ho-Young;Un, Chong-Kwan
    • The Journal of the Acoustical Society of Korea
    • /
    • v.14 no.5
    • /
    • pp.44-50
    • /
    • 1995
  • This paper describes a large-vocabulary continuous speech recognition system using continuous hidden Markov models for the Korean language. To improve the performance of the system, we study on the selection of speech modeling units, inter-word modeling, search algorithm, and grammars. We used triphones as basic speech modeling units, generalized triphones and function word-dependent phones are used to improve the trainability of speech units and to reduce errors in function words. Silence between words is optionally inserted by using a silence model and a null transition. Word pair grammar and bigram model based oil word classes are used. Also we implement a search algorithm to find N-best candidate sentences. A postprocessor reorders the N-best sentences using word triple grammar, selects the most likely sentence as the final recognition result, and finally corrects trivial errors related with postpositions. In recognition tests using a 3,000-word continuous speech database, the system attained $93.1\%$ word recognition accuracy and $73.8\%$ sentence recognition accuracy using word triple grammar in postprocessing.

  • PDF

A Word Dictionary Structure for the Postprocessing of Hangul Recognition (한글인식 후처리용 단어사전의 기억구조)

  • ;Yoshinao Aoki
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.19 no.9
    • /
    • pp.1702-1709
    • /
    • 1994
  • In the postprocessing of Hangul recognition system, the storage structure of contextual information is an important matter for the recognition rate and speed of the entire system. Trie in general is used to represent the context as word dictionary, but the memory space efficiency of the structure is low. Therefore we propose a new structure for word dictionary that has better space efficiency and the equivalent merits of trie. Because Hangul is a compound language, the language can be represented by phonemes or by characters. In the representation by phonemes(P-mode) the retrieval is fast, but the space efficiency is low. In the representation by characters(C-mode) the space efficiency is high, but the retrieval is slow. In this paper the two representation methods are combined to form a hybrid representation(H-mode). At first an optimal level for the combination is selected by two characteristic curves of node utilization and dispersion. Then the input words are represented with trie structure by P-mode from the first to the optimal level, and the rest are represented with sequentially linked list structure by C-mode. The experimental results for the six kinds of word set show that the proposed structure is more efficient. This result is based on the fact that the retrieval for H-mode is as fast as P-mode and the space efficiency is as good as C-mode.

  • PDF

Phonological retrieval and phonological memory skills in children with dyslexia and poor comprehension (난독증 아동과 읽기이해부진 아동의 음운인출과 음운기억 능력)

  • Hyojin Yoon
    • Phonetics and Speech Sciences
    • /
    • v.16 no.2
    • /
    • pp.83-90
    • /
    • 2024
  • This study aimed to explore phonological retrieval and phonological memory skills in second to third graders with dyslexia, poor comprehension, and typical development. The participants included 17 children with dyslexia, 17 children with poor comprehension, and 24 typically developing children. Children with dyslexia scored below 85 on the word decoding test, poor comprehender scored above 90 on the word decoding, and below 85 on the reading comprehension test and typical children scored above 90 on both reading tests. All participants were assessed on rapid automatized naming (RAN) and nonword repetition (NWR). The result indicated that children with dyslexia performed significantly worse on RAN and NWR tasks than other groups. However, there was significant differences between poor comprehender and typically developing children. Furthermore, only RAN were significantly correlated with word decoding and reading comprehension in children with dyslexia. For typically developing children, RAN was correlated with word decoding and reading comprehension, while NWR had a significant correlation with reading comprehension. No correlations were found between these variables for poor comprehender. The finding suggests that children with dyslexia showed difficulties on phonological retrieval and phonological memory, which are essential for reading development while poor comprehender do not have difficulties with phonological processing skills. Phonological processing deficits may underlie word decoding difficulties in dyslexia.

Refinement of Semantic-Information for WSD Using Mutual Information (상호정보량을 이용한 동형이의어 분별용 의미정보의 정제)

  • 김준수;이왕우;김창환;옥철영
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2002.04b
    • /
    • pp.460-463
    • /
    • 2002
  • 사전 뜻풀이에서 추출된 기존의 의미정보는 동형이의어가 포함된 뜻풀이에서 명사, 용언을 모두 추출하는 방법을 이용하여 단어 중의성 해소에 부적절만 정보를 상당수 포함하게 되었다. 이러만 부적절한 정보 때문에 오분석이나 과분석이 발생하게 된다. 그러므로 기존의 의미정보에서 동형이의어 분별에 유용한 정보만을 선택하는 기준이 필요하게 되었다. 본 논문에서는 사전 뜻풀이에서 동형이의어와 의미정보 사이의 상호정보량을 계산하고 임계치를 선정하여 의미정보를 선택제약하는 방법을 이용하였다. 임계치에 의해 제한된 의미정보의 효율성을 실험하기 위한 다양만 동형이의어 분별 실험들을 수행하였다.

  • PDF

Categorization of Korean documents using Support Vector Machines (SVM을 이용한 한글문서 범주화 실험)

  • 최성환;임혜영;정영미
    • Proceedings of the Korean Society for Information Management Conference
    • /
    • 2000.08a
    • /
    • pp.29-32
    • /
    • 2000
  • 자동문서 범주화에 이용되는 학습분류기 중에서 SVM은 자질 차원을 축소하지 않고도 좋은 성능을 보이고 있다. 본 실험에서는 KTSET 텍스트 컬렉션을 대상으로 두 개의 SVM 분류기를 이용하여 자질축소 및 자질표현에 따른 성능비교 실험을 하였다. 자질축소를 위하여 $\chi$$^2$통계량을 자질선정기준으로 사용하였으며, 자질값으로는 단어빈도 및 문헌빈도의 두 요소로 구성되는 다양한 가중치를 사용하였다. 실험결과 SVM은 자질축소에 큰 영향을 받지 않고 가중치 유형에 따라 성능의 차이를 보였다.

  • PDF