• Title/Summary/Keyword: Corpus-based Analysis

Search Result 200, Processing Time 0.028 seconds

Generating a Korean Sentiment Lexicon Through Sentiment Score Propagation (감정점수의 전파를 통한 한국어 감정사전 생성)

  • Park, Ho-Min;Kim, Chang-Hyun;Kim, Jae-Hoon
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.9 no.2
    • /
    • pp.53-60
    • /
    • 2020
  • Sentiment analysis is the automated process of understanding attitudes and opinions about a given topic from written or spoken text. One of the sentiment analysis approaches is a dictionary-based approach, in which a sentiment dictionary plays an much important role. In this paper, we propose a method to automatically generate Korean sentiment lexicon from the well-known English sentiment lexicon called VADER (Valence Aware Dictionary and sEntiment Reasoner). The proposed method consists of three steps. The first step is to build a Korean-English bilingual lexicon using a Korean-English parallel corpus. The bilingual lexicon is a set of pairs between VADER sentiment words and Korean morphemes as candidates of Korean sentiment words. The second step is to construct a bilingual words graph using the bilingual lexicon. The third step is to run the label propagation algorithm throughout the bilingual graph. Finally a new Korean sentiment lexicon is generated by repeatedly applying the propagation algorithm until the values of all vertices converge. Empirically, the dictionary-based sentiment classifier using the Korean sentiment lexicon outperforms machine learning-based approaches on the KMU sentiment corpus and the Naver sentiment corpus. In the future, we will apply the proposed approach to generate multilingual sentiment lexica.

The Role of Distributional Cues in the Acquisition of Verb Argument Structures

  • Kim, Mee-Sook
    • Language and Information
    • /
    • v.7 no.1
    • /
    • pp.87-99
    • /
    • 2003
  • This paper investigates the role of input frequency in the acquisition of verb argument structures based on distributional information of a corpus of utterances derived from the English CHILDES database (MacWhinney 1993). It has been widely accepted that children successfully learn verb argument structures by innate language mechanisms, such as linking rules which connect verb meanings and its syntactic structures. In contrast, an approach to language acquisition called “statistical language learning” has currently claimed that children could succeed in acquiring syntactic structures in the absence of innate language mechanisms, making use of distributional properties of the input. In this paper, I evaluate the feasibility of the statistical learning in acquiring verb argument structures, based on distributional information about locative verbs in parental input. The naturalistic data allow us to investigate to what extent the statistical learning approach can and cannot help children succeed in learning the syntax of locative verbs. Based on the results of English database analysis, I show that there is rich statistical information for learning the syntactic possibilities of locative verbs in parental input, despite some limitations in the statistical learning approach.

  • PDF

A Method of Intonation Modeling for Corpus-Based Korean Speech Synthesizer (코퍼스 기반 한국어 합성기의 억양 구현 방안)

  • Kim, Jin-Young;Park, Sang-Eon;Eom, Ki-Wan;Choi, Seung-Ho
    • Speech Sciences
    • /
    • v.7 no.2
    • /
    • pp.193-208
    • /
    • 2000
  • This paper describes a multi-step method of intonation modeling for corpus-based Korean speech synthesizer. We selected 1833 sentences considering various syntactic structures and built a corresponding speech corpus uttered by a female announcer. We detected the pitch using laryngograph signals and manually marked the prosodic boundaries on recorded speech, and carried out the tagging of part-of-speech and syntactic analysis on the text. The detected pitch was separated into 3 frequency bands of low, mid, high frequency components which correspond to the baseline, the word tone, and the syllable tone. We predicted them using the CART method and the Viterbi search algorithm with a word-tone-dictionary. In the collected spoken sentences, 1500 sentences were trained and 333 sentences were tested. In the layer of word tone modeling, we compared two methods. One is to predict the word tone corresponding to the mid-frequency components directly and the other is to predict it by multiplying the ratio of the word tone to the baseline by the baseline. The former method resulted in a mean error of 12.37 Hz and the latter in one of 12.41 Hz, similar to each other. In the layer of syllable tone modeling, it resulted in a mean error rate less than 8.3% comparing with the mean pitch, 193.56 Hz of the announcer, so its performance was relatively good.

  • PDF

A Deterministic Method for Structural Analysis of Compound Words in Japanese

  • Han, Dongli;Ito, Takeshi;Furugori, Teiji
    • Proceedings of the Korean Society for Language and Information Conference
    • /
    • 2002.02a
    • /
    • pp.79-91
    • /
    • 2002
  • Structural analysis of compound words is necessary and an important process in natural language processing. Proposed here is a corpus- and statistics- based method for the structural analysis of compound words in Japanese. We determine the structure of a compound word by using Internet corpus and calculating the strength of word association among its constituent words. Experiments with 5, 6, 7, and 8 kanji compound words show that our method works well and its performance is better than those of other comparable studies.

  • PDF

An Analysis of the Intellectual Structure of Venture-Creation Studies to build an Entrepreneurship Ontology (창업 온톨로지 구축을 위한 벤처창업 연구의 지식구조 분석)

  • Sim, Jae-Hu;Choi, Myeonggil
    • Knowledge Management Research
    • /
    • v.14 no.4
    • /
    • pp.75-86
    • /
    • 2013
  • The deeping interests and research toward Entrepreneurship, which is considered as an potential alternative for solving the continuing economic recession in the $21^{st}$ century, have grown. The process and methodology of the research could not be systematically arranged and the results of the research lack in efforts on the application of increasing suceess ratio in starting new business. This study adopted corpus methodology, through which we try to analyzes the knowledge structure in entrepreneurship research, derive essential concepts and the consisting domains in venture research. Based on the results of analysis, this study constructs the knowledge structure of venture research in a form of knowledge ontology. The results of the study could be a ground for entrepreneurship research and utilized as implication for a creation of construction for the entrepreneurship knowledge ontology.

  • PDF

Corpus Annotation for the Linguistic Analysis of Reference Relations between Event and Spatial Expressions in Text (텍스트 내 사건-공간 표현 간 참조 관계 분석을 위한 말뭉치 주석)

  • Chung, Jin-Woo;Lee, Hee-Jin;Park, Jong C.
    • Language and Information
    • /
    • v.18 no.2
    • /
    • pp.141-168
    • /
    • 2014
  • Recognizing spatial information associated with events expressed in natural language text is essential not only for the interpretation of such events and but also for the understanding of the relations among them. However, spatial information is rarely mentioned as compared to events and the association between event and spatial expressions is also highly implicit in a text. This would make it difficult to automate the extraction of spatial information associated with events from the text. In this paper, we give a linguistic analysis of how spatial expressions are associated with event expressions in a text. We first present issues in annotating narrative texts with reference relations between event and spatial expressions, and then discuss surface-level linguistic characteristics of such relations based on the annotated corpus to give a helpful insight into developing an automated recognition method.

  • PDF

A Study to Rethink the Components of Teaching Korean Genitive Particle '의': Based on the Errors in Korean Learners' Corpus (한국어 학습자 대상 관형격 조사 '의'의 교육 내용 재고: 학습자 말뭉치에 나타난 오류를 바탕으로)

  • Soo-Hyun Lee;Ji-Young Sim
    • Journal of the Korean Society of Industry Convergence
    • /
    • v.26 no.3
    • /
    • pp.443-454
    • /
    • 2023
  • The purpose of this study is to reveal the Korean learners' usage pattern of '의', the genitive particle, according to semantic classification, so that it can be referred to in determining the contents and methods of related education. The method of this study adopts a quantitative analysis using learners corpus established by National Institute of Korean Language. As a result of the analysis, as proficiency increases, the overall frequency of '의' increases and the number of meaning senses used increases. However, the frequency of errors also increases with it. As for the usage pattern of each sense, the meaning of 'ownership, belonging' is the most frequent, and followed by 'acting entity', 'kinship, social relations', and 'relationship(area)'. In conclusion, the meanings of 'acting subjects' and 'relationships(area) need to be supplemented with explicit education. Other meanings need to be discussed, and decisions should be made in consideration of learning purpose and proficiency.

Digital Application of Intangible Cultural Heritage from the Perspective of Cultural Ecology

  • Jing, Xiuli;Tan, Fang;Zhang, Mu
    • Journal of Smart Tourism
    • /
    • v.1 no.1
    • /
    • pp.41-52
    • /
    • 2021
  • This paper explored the digital application of intangible cultural heritage from the perspective of cultural ecology. Through field investigations, combined with cultural ecology theory, an ontology-based semantic web technology was proposed, and Nanjing "Yunjin" brocade weaving technique was selected as the research object. The specific steps were as follows: First, based on the field surveys and cultural ecology theory, the intangible cultural ecological environment was divided into natural and social environments. Next, constructing the intangible cultural heritage ontology was constructed, including the collection and collation of Nanjing Yunjin weaving technique knowledge corpus, based on user needs analysis and corpus analysis, CIDOC CRM was used to create rules to build the ontology. Finally, based on the MediaWiki platform and Semantic MediaWiki, the semantic web model of the intangible cultural heritage was designed, and its semantic retrieval function was realized, thereby achieving the practical application of intangible cultural heritage digitization. Based on the perspective of cultural ecology, a set of intangible digital application models was proposed, which expanded the digital application of the cultural ecology theory, verified the application of this model in the sustainable development of cultural tourism, and provided reference for the sustainable development of cultural tourism.

The f0 distribution of Korean speakers in a spontaneous speech corpus

  • Yang, Byunggon
    • Phonetics and Speech Sciences
    • /
    • v.13 no.3
    • /
    • pp.31-37
    • /
    • 2021
  • The fundamental frequency, or f0, is an important acoustic measure in the prosody of human speech. The current study examined the f0 distribution of a corpus of spontaneous speech in order to provide normative data for Korean speakers. The corpus consists of 40 speakers talking freely about their daily activities and their personal views. Praat scripts were created to collect f0 values, and a majority of obvious errors were corrected manually by watching and listening to the f0 contour on a narrow-band spectrogram. Statistical analyses of the f0 distribution were conducted using R. The results showed that the f0 values of all the Korean speakers were right-skewed, with a pointy distribution. The speakers produced spontaneous speech within a frequency range of 274 Hz (from 65 Hz to 339 Hz), excluding statistical outliers. The mode of the total f0 data was 102 Hz. The female f0 range, with a bimodal distribution, appeared wider than that of the male group. Regression analyses based on age and f0 values yielded negligible R-squared values. As the mode of an individual speaker could be predicted from the median, either the median or mode could serve as a good reference for the individual f0 range. Finally, an analysis of the continuous f0 points of intonational phrases revealed that the initial and final segments of the phrases yielded several f0 measurement errors. From these results, we conclude that an examination of a spontaneous speech corpus can provide linguists with useful measures to generalize acoustic properties of f0 variability in a language by an individual or groups. Further studies would be desirable of the use of statistical measures to secure reliable f0 values of individual speakers.

Long Short Term Memory based Political Polarity Analysis in Cyber Public Sphere

  • Kang, Hyeon;Kang, Dae-Ki
    • International Journal of Advanced Culture Technology
    • /
    • v.5 no.4
    • /
    • pp.57-62
    • /
    • 2017
  • In this paper, we applied long short term memory(LSTM) for classifying political polarity in cyber public sphere. The data collected from the cyber public sphere is transformed into word corpus data through word embedding. Based on this word corpus data, we train recurrent neural network (RNN) which is connected by LSTM's. Softmax function is applied at the output of the RNN. We conducted our proposed system to obtain experimental results, and we will enhance our proposed system by refining LSTM in our system.