• Title/Summary/Keyword: word-form

Search Result 381, Processing Time 0.029 seconds

Sentiment Analysis of Korean Reviews Using CNN: Focusing on Morpheme Embedding (CNN을 적용한 한국어 상품평 감성분석: 형태소 임베딩을 중심으로)

  • Park, Hyun-jung;Song, Min-chae;Shin, Kyung-shik
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.2
    • /
    • pp.59-83
    • /
    • 2018
  • With the increasing importance of sentiment analysis to grasp the needs of customers and the public, various types of deep learning models have been actively applied to English texts. In the sentiment analysis of English texts by deep learning, natural language sentences included in training and test datasets are usually converted into sequences of word vectors before being entered into the deep learning models. In this case, word vectors generally refer to vector representations of words obtained through splitting a sentence by space characters. There are several ways to derive word vectors, one of which is Word2Vec used for producing the 300 dimensional Google word vectors from about 100 billion words of Google News data. They have been widely used in the studies of sentiment analysis of reviews from various fields such as restaurants, movies, laptops, cameras, etc. Unlike English, morpheme plays an essential role in sentiment analysis and sentence structure analysis in Korean, which is a typical agglutinative language with developed postpositions and endings. A morpheme can be defined as the smallest meaningful unit of a language, and a word consists of one or more morphemes. For example, for a word '예쁘고', the morphemes are '예쁘(= adjective)' and '고(=connective ending)'. Reflecting the significance of Korean morphemes, it seems reasonable to adopt the morphemes as a basic unit in Korean sentiment analysis. Therefore, in this study, we use 'morpheme vector' as an input to a deep learning model rather than 'word vector' which is mainly used in English text. The morpheme vector refers to a vector representation for the morpheme and can be derived by applying an existent word vector derivation mechanism to the sentences divided into constituent morphemes. By the way, here come some questions as follows. What is the desirable range of POS(Part-Of-Speech) tags when deriving morpheme vectors for improving the classification accuracy of a deep learning model? Is it proper to apply a typical word vector model which primarily relies on the form of words to Korean with a high homonym ratio? Will the text preprocessing such as correcting spelling or spacing errors affect the classification accuracy, especially when drawing morpheme vectors from Korean product reviews with a lot of grammatical mistakes and variations? We seek to find empirical answers to these fundamental issues, which may be encountered first when applying various deep learning models to Korean texts. As a starting point, we summarized these issues as three central research questions as follows. First, which is better effective, to use morpheme vectors from grammatically correct texts of other domain than the analysis target, or to use morpheme vectors from considerably ungrammatical texts of the same domain, as the initial input of a deep learning model? Second, what is an appropriate morpheme vector derivation method for Korean regarding the range of POS tags, homonym, text preprocessing, minimum frequency? Third, can we get a satisfactory level of classification accuracy when applying deep learning to Korean sentiment analysis? As an approach to these research questions, we generate various types of morpheme vectors reflecting the research questions and then compare the classification accuracy through a non-static CNN(Convolutional Neural Network) model taking in the morpheme vectors. As for training and test datasets, Naver Shopping's 17,260 cosmetics product reviews are used. To derive morpheme vectors, we use data from the same domain as the target one and data from other domain; Naver shopping's about 2 million cosmetics product reviews and 520,000 Naver News data arguably corresponding to Google's News data. The six primary sets of morpheme vectors constructed in this study differ in terms of the following three criteria. First, they come from two types of data source; Naver news of high grammatical correctness and Naver shopping's cosmetics product reviews of low grammatical correctness. Second, they are distinguished in the degree of data preprocessing, namely, only splitting sentences or up to additional spelling and spacing corrections after sentence separation. Third, they vary concerning the form of input fed into a word vector model; whether the morphemes themselves are entered into a word vector model or with their POS tags attached. The morpheme vectors further vary depending on the consideration range of POS tags, the minimum frequency of morphemes included, and the random initialization range. All morpheme vectors are derived through CBOW(Continuous Bag-Of-Words) model with the context window 5 and the vector dimension 300. It seems that utilizing the same domain text even with a lower degree of grammatical correctness, performing spelling and spacing corrections as well as sentence splitting, and incorporating morphemes of any POS tags including incomprehensible category lead to the better classification accuracy. The POS tag attachment, which is devised for the high proportion of homonyms in Korean, and the minimum frequency standard for the morpheme to be included seem not to have any definite influence on the classification accuracy.

Wh-movement in the L2 Learner's Initial Syntax

  • Kim, Jung-Tae
    • English Language & Literature Teaching
    • /
    • v.10 no.2
    • /
    • pp.1-23
    • /
    • 2004
  • This article reports a bi-directional interlanguage study designed to investigate the initial state of L2 acquisition with regard to English and Korean wh-questions. Based on the UG system in line with the minimalist theory, it was hypothesized that the L2 initial state is characterized by the most economical form of syntax in which no overt wh-movement to Spec-CP is assumed. Results of the early interlanguage study showed that 1) L1 Korean learners of L2 English predominantly produced wh-questions with the fronted wh-word, but without productive wh-movement to the Spec-CP position; and 2) L1 English learners of L2 Korean overwhelmingly produced wh-questions with the wh-word remaining in-situ. These results were interpreted as supporting the minimalist account of the L2 initial grammar in that no overt syntactic wh-movement were adopted in early interlanguages of both English and Korean regardless of the learner's L1.

  • PDF

A Study for Success Factors in On-line Games

  • Jung, Jai-Jin
    • Journal of Korea Multimedia Society
    • /
    • v.9 no.12
    • /
    • pp.1657-1668
    • /
    • 2006
  • The last few years have represented a boom for the online gaming industry. Internet-based online games have been an increasingly popular form of entertainment. The gaming industry estimates there will be over 26 million online gaming participants in 2002. The rapid development of online game content and related information technology will increase the size of the industry and have a profound impact on many aspects of our lives and our society. This paper develops the exploratory LISREL model for identifying the factors affecting the players' loyalty to a specific brand of online game. The concepts of flow, word of mouth, feedback, challenge, social norms, and online community activities, etc, are all introduced into the model, as the independent variables directly and indirectly affecting loyalty. Based on data collected from an online survey, the validity of the model has been tested and interesting conclusions have been developed concerning the relationships between loyalty and flow, word of mouth, and other independent variables. It is hoped that this result might provide useful guidelines for developing successful online game content.

  • PDF

THE FRACTIONAL TOTIENT FUNCTION AND STURMIAN DIRICHLET SERIES

  • Kwon, DoYong
    • Honam Mathematical Journal
    • /
    • v.39 no.2
    • /
    • pp.297-305
    • /
    • 2017
  • Let ${\alpha}$ > 0 be a real number and $(s_{\alpha}(n))_{n{\geq}1}$ be the lexicographically greatest Sturmian word of slope ${\alpha}$. We investigate Dirichlet series of the form ${\sum}^{\infty}_{n=1}s_{\alpha}(n)n^{-s}$. To do this, a generalization of Euler's totient function is required. For a real ${\alpha}$ > 0 and a positive integer n, an arithmetic function ${\varphi}{\alpha}(n)$ is defined to be the number of positive integers m for which gcd(m, n) = 1 and 0 < m/n < ${\alpha}$. Under a condition Re(s) > 1, this paper establishes an identity ${\sum}^{\infty}_{n=1}s_{\alpha}(n)n^{-S}=1+{\sum}^{\infty}_{n=1}{\varphi}_{\alpha}(n)({\zeta}(s)-{\zeta}(s,1+n^{-1}))n^{-s}$.

An Analysis of English Reduplicative compounds (영어 중첩복합어 분석)

  • 김형엽
    • Lingua Humanitatis
    • /
    • v.2 no.1
    • /
    • pp.303-314
    • /
    • 2002
  • The main purpose of this paper is to show how Jespersen analyzed the date of English compound related with reduplication. Especially dealing with the compound words he classified the examples related with reduplication as a separate part and attempted to account for the patters based on the structure of the first syllable constituting the initial part of the second element in a compound word. 1 tried to explain the peculiar shape of the reduplicational pattern in English based on the Optimality Theory, especially the method of 'melodic overwriting' of McCarthy(1997). According to the analysis the initial part of the second element of a compound has to be stipulated before reduplication occurs. When the reduplicant has to be decided at the first syllable of the second element, the form which is stipulated to take the position comes to appear at the post instead of repeating the morphemic shape of the first syllable at the first element of the word.

  • PDF

A study on the speech recognition by HMM based on multi-observation sequence (다중 관측열을 토대로한 HMM에 의한 음성 인식에 관한 연구)

  • 정의봉
    • Journal of the Korean Institute of Telematics and Electronics S
    • /
    • v.34S no.4
    • /
    • pp.57-65
    • /
    • 1997
  • The purpose of this paper is to propose the HMM (hidden markov model) based on multi-observation sequence for the isolated word recognition. The proosed model generates the codebook of MSVQ by dividing each word into several sections followed by dividing training data into several sections. Then, we are to obtain the sequential value of multi-observation per each section by weighting the vectors of distance form lower values to higher ones. Thereafter, this the sequential with high probability value while in recognition. 146 DDD area names are selected as the vocabularies for the target recognition, and 10LPC cepstrum coefficients are used as the feature parameters. Besides the speech recognition experiments by way of the proposed model, for the comparison with it, the experiments by DP, MSVQ, and genral HMM are made with the same data under the same condition. The experiment results have shown that HMM based on multi-observation sequence proposed in this paper is proved superior to any other methods such as the ones using DP, MSVQ and general HMM models in recognition rate and time.

  • PDF

An improved spectrum mapping applied to speaker adaptive Kroean word recognition

  • Matsumoto, Hiroshi;Lee, Yong-Ju;Kim, Hoi-Rim;Kido, Ken'iti
    • Proceedings of the Acoustical Society of Korea Conference
    • /
    • 1994.06a
    • /
    • pp.1009-1014
    • /
    • 1994
  • This paper improves the previously proposed spectral mapping method for supervised speaker adaptation in which a mapped spectrum is interpolated from speaker difference vectors at typical spectra based on a minimized distortion criterion. In estimating these difference vectors, it is important to find an appropriate number of typical points. The previous method empirically adjusts the number of typical points, while the present method optimizes the effective number by rank reduction of normal equation. This algorithm was applied to a supervised speaker adaptation for Korean word recognition using the templates form a prototype male speaker. The result showed that the rank reduction technique not only can automatically determine an optimal number of code vectors, but also slightly improves the recognition scores compared with those obtained by the previous method.

  • PDF

Influence Maximization Scheme against Various Social Adversaries

  • Noh, Giseop;Oh, Hayoung;Lee, Jaehoon
    • Journal of information and communication convergence engineering
    • /
    • v.16 no.4
    • /
    • pp.213-220
    • /
    • 2018
  • With the exponential developments of social network, their fundamental role as a medium to spread information, ideas, and influence has gained importance. It can be expressed by the relationships and interactions within a group of individuals. Therefore, some models and researches from various domains have been in response to the influence maximization problem for the effects of "word of mouth" of new products. For example, in reality, more than two related social groups such as commercial companies and service providers exist within the same market issue. Under such a scenario, they called social adversaries competitively try to occupy their market influence against each other. To address the influence maximization (IM) problem between them, we propose a novel IM problem for social adversarial players (IM-SA) which are exploiting the social network attributes to infer the unknown adversary's network configuration. We sophisticatedly define mathematical closed form to demonstrate that the proposed scheme can have a near-optimal solution for a player.

Historic Status and Grammatical Characteristics of Korean language in the Early 20th Century (한국어사에서 20세기 초 한국어의 위상과 문법 특징)

  • Hong, Jongseon
    • Korean Linguistics
    • /
    • v.71
    • /
    • pp.1-22
    • /
    • 2016
  • The early 20th century is a period of time when Korea confronted with the surging waves of modernization, and made a variety of internal reactions. The Korean language, not immune to the upheaval, also experienced new changes and gradually gained characteristics of today's Korean. Although scholars have not yet fully agreed upon the time division of Korean, Gabo reformation (1896) is usually considered to be the beginning of modern Korean. Thus, the early 20th century was also the beginning of modern Korean. Phonological, lexical, and grammatical characteristics of modern day Korean began to appear during this period of time. Phonologically, the 10 vowel system was established, glottal sounds and aspirated sounds increased, vowel harmony declined. Phenomena such as vowel raising, front-vowelization, monophthongization, and the word-initial rule appeared. Meanwhile, hangul-Chinese mix writing became common practice, and hangul-only writing also started to take place in narrative writing, and elements of spoken language began to reflect in written language. All those pointed to the unification of written and spoken language. Under the influence of modernization, a great amount of new words appeared. Especially, Japanese and other foreign words flooded in in great quantities. Grammatically, '-eos-(-엇-), -neun-(-는-), -ges-(-겟-)' trichotomy system of tenses was established, and hearer-oriented honorific system also formed a binary system of 'hasoseo(하소서), hasibsio(하십시오), hao(하오), hage(하게), haera(해라)' and 'hae (해), haeyo(해요)'. In word formation and sentence construction, the use of '-gi(-기)' became more frequent than '-eum(-음)', while '~geot(~것)' also significantly increased. In negative, causative and passive expressions, the use of long form, which has fewer restrictions than the short form, became more frequent. A tendency towards simplicity appeared. In the same vain, long and complex sentences with several clauses tend to be avoided. Instead, short simple sentences became more favorable. Korean linguistics scholars should pay closer attention to the modernization period, which includes the early 20th century. In order to fully understand today's Korean language, more thorough research on this immediately preceding period is necessary.

Rule Based Document Conversion and Information Extraction on the Word Document (워드문서 콘텐츠의 사용자 XML 콘텐츠로의 변환 및 저장 시스템 개발)

  • Joo, Won-Kyun;Yang, Myung-Seok;Kim, Tae-Hyun;Lee, Min-Ho;Choi, Ki-Seok
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2006.11a
    • /
    • pp.555-559
    • /
    • 2006
  • This paper will intend to contribute to extracting and storing various form of information on user interests by using structural rules user makes and XML-based word document converting techniques. The system named PPE consists of three essential element. One is converting element which converts word documents like HWP, DOC into XML documents, another is extracting element to prepare structural rules and extract concerned information from XML document by structural rules, and the other is storing element to make final XML document or store it into database system. For word document converting, we developed OCX based word converting daemon. Helping user to extracting information, we developed script language having native function/variable processing engine extended from XSLT. This system can be used in the area of constructing word document contents DB or providing various information service based on RAW word documents. We really applied it to project management system and project result management system.

  • PDF