• Title/Summary/Keyword: Korean corpus

Search Result 1,202, Processing Time 0.021 seconds

Extracting Multiword Sentiment Expressions by Using a Domain-Specific Corpus and a Seed Lexicon

  • Lee, Kong-Joo;Kim, Jee-Eun;Yun, Bo-Hyun
    • ETRI Journal
    • /
    • v.35 no.5
    • /
    • pp.838-848
    • /
    • 2013
  • This paper presents a novel approach to automatically generate Korean multiword sentiment expressions by using a seed sentiment lexicon and a large-scale domain-specific corpus. A multiword sentiment expression consists of a seed sentiment word and its contextual words occurring adjacent to the seed word. The multiword sentiment expressions that are the focus of our study have a different polarity from that of the seed sentiment word. The automatically extracted multiword sentiment expressions show that 1) the contextual words should be defined as a part of a multiword sentiment expression in addition to their corresponding seed sentiment word, 2) the identified multiword sentiment expressions contain various indicators for polarity shift that have rarely been recognized before, and 3) the newly recognized shifters contribute to assigning a more accurate polarity value. The empirical result shows that the proposed approach achieves improved performance of the sentiment analysis system that uses an automatically generated lexicon.

Language Model Adaptation for Broadcast News Recognition (방송 뉴스 인식을 위한 언어 모델 적응)

  • Kim Hyun Suk;Jeon Hyung Bae;Kim Sanghun;Choi Joon Ki;Yun Seung
    • MALSORI
    • /
    • no.51
    • /
    • pp.99-115
    • /
    • 2004
  • In this parer, we propose LM adaptation for broadcast news recognition. We collect information of recent articles from the internet on real time, make a recent small size LM, and then interpolate recent LM with a existing LM composed of existing large broadcast news corpus. We performed interpolation experiments to get the best type of articles from recent corpus because collected recent corpus is composed of articles which are related with test set, and which are unrelated. When we made an adapted LM using recent LM with similar articles to test set through Tf-Idf method and existing LM, we got the best result that ERR of pseudo-morpheme based recognition performance has 17.2 % improvement and the number of OOV has reduction from 70 to 27.

  • PDF

English Conditional Inversion: A Construction-Based Approach

  • Kim, Jong-Bok
    • Language and Information
    • /
    • v.15 no.1
    • /
    • pp.13-29
    • /
    • 2011
  • Conditional sentences also can be formed by inversion of subject and auxiliary, but it happens only in a limited environment. This paper addresses grammatical constraints in conditional inversion and how they behave differently from the regular conditional clauses based on corpus investigations. Our corpus search reveals many different types of conditional inversion constructions, indicating the difficulties of deriving inverted conditionals from movement operations. In this paper, we provide a construction-based approach to the inverted conditional construction. The paper shows that the most optimal way of describing the general as well as idiosyncratic properties of the inverted conditional constructions is an account in the spirit of construction grammar in which a grammar is a repertory of constructions forming a network connected by links of inheritance.

  • PDF

N- gram Adaptation Using Information Retrieval and Dynamic Interpolation Coefficient (정보검색 기법과 동적 보간 계수를 이용한 N-gram 언어모델의 적응)

  • Choi Joon Ki;Oh Yung-Hwan
    • MALSORI
    • /
    • no.56
    • /
    • pp.207-223
    • /
    • 2005
  • The goal of language model adaptation is to improve the background language model with a relatively small adaptation corpus. This study presents a language model adaptation technique where additional text data for the adaptation do not exist. We propose the information retrieval (IR) technique with N-gram language modeling to collect the adaptation corpus from baseline text data. We also propose to use a dynamic language model interpolation coefficient to combine the background language model and the adapted language model. The interpolation coefficient is estimated from the word hypotheses obtained by segmenting the input speech data reserved for held-out validation data. This allows the final adapted model to improve the performance of the background model consistently The proposed approach reduces the word error rate by $13.6\%$ relative to baseline 4-gram for two-hour broadcast news speech recognition.

  • PDF

Input Dimension Reduction based on Continuous Word Vector for Deep Neural Network Language Model (Deep Neural Network 언어모델을 위한 Continuous Word Vector 기반의 입력 차원 감소)

  • Kim, Kwang-Ho;Lee, Donghyun;Lim, Minkyu;Kim, Ji-Hwan
    • Phonetics and Speech Sciences
    • /
    • v.7 no.4
    • /
    • pp.3-8
    • /
    • 2015
  • In this paper, we investigate an input dimension reduction method using continuous word vector in deep neural network language model. In the proposed method, continuous word vectors were generated by using Google's Word2Vec from a large training corpus to satisfy distributional hypothesis. 1-of-${\left|V\right|}$ coding discrete word vectors were replaced with their corresponding continuous word vectors. In our implementation, the input dimension was successfully reduced from 20,000 to 600 when a tri-gram language model is used with a vocabulary of 20,000 words. The total amount of time in training was reduced from 30 days to 14 days for Wall Street Journal training corpus (corpus length: 37M words).

A Study on the Voice Onset Times of the Buckeye Corpus Stops (벅아이 코퍼스 파열음의 성대진동 개시시간 연구)

  • Park, Soo Hee;Yoon, Kyuchul
    • Phonetics and Speech Sciences
    • /
    • v.8 no.1
    • /
    • pp.9-17
    • /
    • 2016
  • The purpose of this work is to examine the voice onset times(VOTs) of the voiceless and voiced stops from the ten young male speakers of the Buckeye corpus[9]. The factors that are known to affect VOTs were also extracted, including the place of articulation, height of following vowels, location within word, presence of a preceding [s], status of the target word with respect to the content versus function word, presence of a syllabic stress, word frequency and speech rate. Findings from this work mostly agreed with those from earlier studies on English, but with some exceptions and new discoveries. We hope that this work can contribute to figuring out the nature and properties of the spontaneous speech of English.

An Analysis of the Vowel Formants of the Young versus Old Speakers in the Buckeye Corpus (벅아이 코퍼스에서의 연령별 모음 포먼트 분석)

  • Km, Ji-Eun;Yoon, Kyuchul
    • Phonetics and Speech Sciences
    • /
    • v.4 no.4
    • /
    • pp.29-35
    • /
    • 2012
  • The purpose of this study was to measure the first two vowel formants of the forty male and female speakers (twenty young vs. old male speakers and twenty young vs. old female speakers) from the Buckeye Corpus of Conversational Speech and to examine the vowel formant changes across two generations (younger vs. older). The results indicated that the vowel space of the younger generation (in their thirties or less) shifted to the lower left position compared to those of the older generation (in their forties or more) in both male and female speakers. When the results were compared to those of Peterson & Barney (1952), it appears that differences can be found in the size of the vowel spaces through time.

Transient global amnesia associated with multiple lesions in the corpus callosum and hippocampus

  • Kim, Jin-Ah;Min, Young Gi;Koo, Dae Lim
    • Annals of Clinical Neurophysiology
    • /
    • v.21 no.2
    • /
    • pp.102-104
    • /
    • 2019
  • Transient global amnesia is a syndrome of temporary loss of short-term memory and is not accompanied by any other neurological deficit. Diffusion-weighted imaging is useful to improve the diagnostic accuracy of transient global amnesia. We report a 68-year-old woman with multiple lesions on diffusion-weighted imaging in the right corpus callosum and left hippocampus. To the best of our knowledge, this is the first case of a diffusion-weighted imaging lesion in the body portion of the corpus callosum.

AP, IP Prediction For Corpus-based Korean Text-To-Speech (코퍼스 방식 음성합성에서의 개선된 운율구 경계 예측)

  • Kwon, O-Hil;Hong, Mun-Ki;Kang, Sun-Mee;Shin, Ji-Young
    • Speech Sciences
    • /
    • v.9 no.3
    • /
    • pp.25-34
    • /
    • 2002
  • One of the most important factor in the performance of Korean text-to-speech system is the prediction of accentual and intonational phrase boundary. The previous method of prediction shows only the 75-85% which is not proper in the practical and commercial system. Therefore, more accurate prediction must be needed in the practical system. In this study, we propose the simple and more accurate method of the prediction of AP, IP.

  • PDF

An Analysis on Korean Intonation Patterns Using Momel (Momel을 이용한 한국어의 억양 패턴 분석)

  • Kim, Sun-Hee;Yoo, Hyun-Ji
    • Proceedings of the KSPS conference
    • /
    • 2007.05a
    • /
    • pp.243-246
    • /
    • 2007
  • This paper aims to propose an intonation labeling method using Momel and to present results of analyzing a speech corpus consisting of 80 passages pronounced by 4 speakers (2 male and 2 female) using the proposed method. The results show that Momel works well enough to derive meaningful pitch targets, which could be labeled with H and L tones. On the other hand, the results of the analysis of Korean speech corpus correspond to earlier work.

  • PDF