• 제목/요약/키워드: Korean corpus

검색결과 1,197건 처리시간 0.027초

사전과 말뭉치를 이용한 한국어 단어 중의성 해소 (Korean Word Sense Disambiguation using Dictionary and Corpus)

  • 정한조;박병화
    • 지능정보연구
    • /
    • 제21권1호
    • /
    • pp.1-13
    • /
    • 2015
  • 빅데이터 및 오피니언 마이닝 분야가 대두됨에 따라 정보 검색/추출, 특히 비정형 데이터에서의 정보 검색/추출 기술의 중요성이 나날이 부각되어지고 있다. 또한 정보 검색 분야에서는 이용자의 의도에 맞는 결과를 제공할 수 있는 검색엔진의 성능향상을 위한 다양한 연구들이 진행되고 있다. 이러한 정보 검색/추출 분야에서 자연어처리 기술은 비정형 데이터 분석/처리 분야에서 중요한 기술이고, 자연어처리에 있어서 하나의 단어가 여러개의 모호한 의미를 가질 수 있는 단어 중의성 문제는 자연어처리의 성능을 향상시키기 위해 우선적으로 해결해야하는 문제점들의 하나이다. 본 연구는 단어 중의성 해소 방법에 사용될 수 있는 말뭉치를 많은 시간과 노력이 요구되는 수동적인 방법이 아닌, 사전들의 예제를 활용하여 자동적으로 생성할 수 있는 방법을 소개한다. 즉, 기존의 수동적인 방법으로 의미 태깅된 세종말뭉치에 표준국어대사전의 예제를 자동적으로 태깅하여 결합한 말뭉치를 사용한 단어 중의성 해소 방법을 소개한다. 표준국어대사전에서 단어 중의성 해소의 주요 대상인 전체 명사 (265,655개) 중에 중의성 해소의 대상이 되는 중의어 (29,868개)의 각 센스 (93,522개)와 연관된 속담, 용례 문장 (56,914개)들을 결합 말뭉치에 추가하였다. 품사 및 센스가 같이 태깅된 세종말뭉치의 약 79만개의 문장과 표준국어대사전의 약 5.7만개의 문장을 각각 또는 병합하여 교차검증을 사용하여 실험을 진행하였다. 실험 결과는 결합 말뭉치를 사용하였을 때 정확도와 재현율에 있어서 향상된 결과가 발견되었다. 본 연구의 결과는 인터넷 검색엔진 등의 검색결과의 성능향상과 오피니언 마이닝, 텍스트 마이닝과 관련한 자연어 분석/처리에 있어서 문장의 내용을 보다 명확히 파악하는데 도움을 줄 수 있을 것으로 기대되어진다.

문형 사전을 위한 문형 빈도 조사 (Studying the frequencies of sentence pattern for a entence patterns dictionary)

  • 김유미
    • 인지과학
    • /
    • 제16권2호
    • /
    • pp.123-140
    • /
    • 2005
  • 이 논문은 한국어 교육에서 문형 전자 사전을 바탕으로 하는 자동문형 검사기를 설계하기 위해 문형의 출현 빈도와 사용 빈도 조사를 목적으로 하였다. 먼저 한국어 교육에서의 문형의 개념을 정의하고 그 유형을 구문 문형과 표현 문형으로 나누어 분류하였다. 서술어 중심의 구문 문형과 의존명사, 어미, 조사가 중심인 표현 문형이 학습자 코퍼스에서 어떻게 나타나는지 분석하였다. 학습자 코퍼스는 학습자들이 꼭 배워야 하는 것으로 표준 코퍼스와 학습자들의 생산물인 오류 코퍼스로 나누어 구축하였다. 한국어 교재로 구성된 표준 코퍼스에서의 문형 출현 빈도와 학습자들이 직접 작성한 글을 모은 오류 코퍼스에서 어떻게 문형이 사용되고 있는지 사용 빈도를 조사하였다. 학습자들의 문형 사용 빈도순은 문형 전자 사전에 기술되고, 이것은 문형 검색 속도를 최적화할 것이다.

  • PDF

A Corpus-based Analysis of EFL Learners' Use of Discourse Markers in Cross-cultural Communication

  • Min, Sujung
    • 영어어문교육
    • /
    • 제17권3호
    • /
    • pp.177-194
    • /
    • 2011
  • This study examines the use of discourse markers in cross-cultural communication between EFL learners in an e-learning environment. The study analyzes the use of discourse markers in a corpus of an interactive web with a bulletin board system through which college students of English at Japanese and Korean universities interacted with each other discussing the topics of local and global issues. It compares the use of discourse markers in the learners' corpus to that of a native English speakers' corpus. The results indicate that discourse markers are useful interactional devices to structure and organize discourse. EFL learners are found to display more frequent use of referentially and cognitively functional discourse markers and a relatively rare use of other markers. Native speakers are found to use a wider variety of discourse markers for different functions. Suggestions are made for using computer corpora in understanding EFL learners' language difficulties and helping them become more interactionally competent speakers.

  • PDF

코퍼스 기반 프랑스어 텍스트 정규화 평가 (Corpus-based evaluation of French text normalization)

  • 김선희
    • 말소리와 음성과학
    • /
    • 제10권3호
    • /
    • pp.31-39
    • /
    • 2018
  • This paper aims to present a taxonomy of non-standard words (NSW) for developing a French text normalization system and to propose a method for evaluating this system based on a corpus. The proposed taxonomy of French NSWs consists of 13 categories, including 2 types of letter-based categories and 9 types of number-based categories. In order to evaluate the text normalization system, a representative test set including NSWs from various text domains, such as news, literature, non-fiction, social-networking services (SNSs), and transcriptions, is constructed, and an evaluation equation is proposed reflecting the distribution of the NSW categories of the target domain to which the system is applied. The error rate of the test set is 1.64%, while the error rate of the whole corpus is 2.08%, reflecting the NSW distribution in the corpus. The results show that the literature and SNS domains are assessed as having higher error rates compared to the test set.

벅아이 코퍼스 오류 수정과 코퍼스 활용을 위한 프랏 스크립트 툴 (Error Correction and Praat Script Tools for the Buckeye Corpus of Conversational Speech)

  • 윤규철
    • 말소리와 음성과학
    • /
    • 제4권1호
    • /
    • pp.29-47
    • /
    • 2012
  • The purpose of this paper is to show how to convert the label files of the Buckeye Corpus of Spontaneous Speech [1] into Praat format and to introduce some of the Praat scripts that will enable linguists to study various aspects of spoken American English present in the corpus. During the conversion process, several types of errors were identified and corrected either manually or automatically by the use of scripts. The Praat script tools that have been developed can help extract from the corpus massive amounts of phonetic measures such as the VOT of plosives, the formants of vowels, word frequency information and speech rates that span several consecutive words. The script tools can extract additional information concerning the phonetic environment of the target words or allophones.

두뇌 자기공명영상에서의 corpus callosum의 자동인식 알고리즘 (Algorithm for automatic recognition of corpus callosum from saggital brain MR images)

  • 허신;이철희
    • 대한의용생체공학회:학술대회논문집
    • /
    • 대한의용생체공학회 1998년도 추계학술대회
    • /
    • pp.62-63
    • /
    • 1998
  • In this paper, a new method to find the corpus callosum from sagittal brain MR images is proposed, which uses the statistical characteristics and shape information of corpus callosum. First, we extract regions satisfying the statistical characteristics of the corpus callosum and then find a region matching the shape information. In order to match the shape information, a new directed window region growing algorithm is proposed instead of using conventional contour matching algorithms. Using the proposed algorithm, we adaptively relax the statistical requirement until we find a region matching the shape information. Experiments show very promising results.

  • PDF

Prostaglandin $F_2{\alpha}$ Controls Reactive Oxygen Species in Bovine Corpus Luteum

  • Lee, Seunghyung;Yang, Boo-Keun;Park, Choon-Keun
    • Reproductive and Developmental Biology
    • /
    • 제39권1호
    • /
    • pp.1-6
    • /
    • 2015
  • Luteolysis is a cyclical regression of the corpus luteum in many non-primate mammalian species. Prostaglandin $F_2{\alpha}$($PGF_2{\alpha}$) from the uterus and ovary induces functional and structural luteolysis in bovine. The action of $PGF_2{\alpha}$ is mediated by $PGF_2{\alpha}$ receptor located on the luteal steroidogenic and endothelial cell membranes. $PGF_2{\alpha}$ plays an important role in regulating nitric oxide production in endothelial cells of the bovine corpus luteum. Nitric oxide production and nitric oxide synthase activity are stimulated and induced by $PGF_2{\alpha}$ in luteal endothelial cells. Moreover, the reactive oxygen species inhibits progesterone secretion in bovine luteal cells and induces apoptosis. Thus, the interaction between $PGF_2{\alpha}$ and reactive oxygen species provides important aspects in physiology of the corpus luteum forfunctional and structural luteolysis.

벅아이 코퍼스를 이용한 영어 무성파열음의 VOT 연구 (A Study on the Voice Onset Time of English Voiceless Stops in the Buckeye Corpus)

  • 윤규철
    • 말소리와 음성과학
    • /
    • 제4권2호
    • /
    • pp.33-40
    • /
    • 2012
  • The purpose of this paper is to investigate the voice onset time (VOT) of the English voiceless stops [p, t, k] found in the Buckeye Corpus of Conversational Speech [1]. Three young female speakers were chosen for this study and their VOT values were semi-automatically extracted along with other factors. The factors used for the analysis were place of articulation, location in word, syllabic stress, content word or not, word frequency calculated from the corpus, and the speech rate expressed in syllables per second. Results showed that, for the three places of articulation of each speaker, all the factors had a statistically significant effect on the VOT values. This paper has significance in that the materials used for the analysis were from a corpus of spontaneous natural English speech.

벅아이 코퍼스에서의 젊은 성인 남성의 모음 포먼트 분석 (An Analysis of the Vowel Formants of the Young Males in the Buckeye Corpus)

  • 윤규철;노혜욱
    • 말소리와 음성과학
    • /
    • 제4권2호
    • /
    • pp.41-49
    • /
    • 2012
  • The purpose of this paper is to extract the vowel formants of the ten young male speakers from the Buckeye Corpus of Conversational Speech [1] and to analyze them in comparison to earlier works in terms of various phonetic factors that are expected to affect the realization of the formant distribution. The first two formant frequency values were automatically extracted with a Praat script along with such factors as the place of articulation, the content versus function word information, syllabic stress information, the location in a word, location in utterance, speech rate of three consecutive words, and the word frequency in the corpus. The results indicated that the formant patterns from the corpus were very different from those of earlier works although the overall pattern was similar and that the factors were strongly responsible for the realization of the two formants. The purpose of this paper is to extract the vowel formants of the ten young male speakers from the Buckeye Corpus of Conversational Speech [1] and to analyze them in comparison to earlier works in terms of various phonetic factors that are expected to affect the realization of the formant distribution. The first two formant frequency values were automatically extracted with a Praat script along with such factors as the place of articulation, the content versus function word information, the syllabic stress information, the location in a word, the location in an utterance, the speech rate of the three consecutive words, and the word frequency in the corpus. The result indicated that the formant patterns from the corpus were very different from those of earlier works although the overall pattern was similar and that the factors were strongly responsible for the realization of the two formants.

Reduction and Frequency Analyses of Vowels and Consonants in the Buckeye Speech Corpus

  • Yang, Byung-Gon
    • 말소리와 음성과학
    • /
    • 제4권3호
    • /
    • pp.75-83
    • /
    • 2012
  • The aims of this study were three. First, to examine the degree of deviation from dictionary prescribed symbols and actual speech made by American English speakers. Second, to measure the frequency of vowel and consonant production of American English speakers. And third, to investigate gender differences in the segmental sounds in a speech corpus. The Buckeye Speech Corpus was recorded by forty American male and female subjects for one hour per subject. The vowels and consonants in both the phonemic and phonetic transcriptions were extracted from the original files of the corpus and their frequencies were obtained using codes of a free software R. Results were as follows: Firstly, the American English speakers produced a reduced number of vowels and consonants in daily conversation. The reduction rate from the dictionary transcriptions to the actual transcriptions was around 38.2%. Secondly, the American English speakers used more front high and back low vowels while three-fourths of the consonants accounted for stops, fricatives, and nasals. This indicates that the segmental inventory has nonlinear frequency distribution in the speech corpus. Thirdly, the two gender groups produced vowels and consonants similarly even though there were a few noticeable differences in their speech. From these results we propose that English teachers consider pronunciation education reflecting the actual speech sounds and that linguists find a way to establish unmarked segmentals from speech corpora.