• Title/Summary/Keyword: Homonym

Search Result 44, Processing Time 0.017 seconds

Sentiment Analysis of Korean Reviews Using CNN: Focusing on Morpheme Embedding (CNN을 적용한 한국어 상품평 감성분석: 형태소 임베딩을 중심으로)

  • Park, Hyun-jung;Song, Min-chae;Shin, Kyung-shik
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.2
    • /
    • pp.59-83
    • /
    • 2018
  • With the increasing importance of sentiment analysis to grasp the needs of customers and the public, various types of deep learning models have been actively applied to English texts. In the sentiment analysis of English texts by deep learning, natural language sentences included in training and test datasets are usually converted into sequences of word vectors before being entered into the deep learning models. In this case, word vectors generally refer to vector representations of words obtained through splitting a sentence by space characters. There are several ways to derive word vectors, one of which is Word2Vec used for producing the 300 dimensional Google word vectors from about 100 billion words of Google News data. They have been widely used in the studies of sentiment analysis of reviews from various fields such as restaurants, movies, laptops, cameras, etc. Unlike English, morpheme plays an essential role in sentiment analysis and sentence structure analysis in Korean, which is a typical agglutinative language with developed postpositions and endings. A morpheme can be defined as the smallest meaningful unit of a language, and a word consists of one or more morphemes. For example, for a word '예쁘고', the morphemes are '예쁘(= adjective)' and '고(=connective ending)'. Reflecting the significance of Korean morphemes, it seems reasonable to adopt the morphemes as a basic unit in Korean sentiment analysis. Therefore, in this study, we use 'morpheme vector' as an input to a deep learning model rather than 'word vector' which is mainly used in English text. The morpheme vector refers to a vector representation for the morpheme and can be derived by applying an existent word vector derivation mechanism to the sentences divided into constituent morphemes. By the way, here come some questions as follows. What is the desirable range of POS(Part-Of-Speech) tags when deriving morpheme vectors for improving the classification accuracy of a deep learning model? Is it proper to apply a typical word vector model which primarily relies on the form of words to Korean with a high homonym ratio? Will the text preprocessing such as correcting spelling or spacing errors affect the classification accuracy, especially when drawing morpheme vectors from Korean product reviews with a lot of grammatical mistakes and variations? We seek to find empirical answers to these fundamental issues, which may be encountered first when applying various deep learning models to Korean texts. As a starting point, we summarized these issues as three central research questions as follows. First, which is better effective, to use morpheme vectors from grammatically correct texts of other domain than the analysis target, or to use morpheme vectors from considerably ungrammatical texts of the same domain, as the initial input of a deep learning model? Second, what is an appropriate morpheme vector derivation method for Korean regarding the range of POS tags, homonym, text preprocessing, minimum frequency? Third, can we get a satisfactory level of classification accuracy when applying deep learning to Korean sentiment analysis? As an approach to these research questions, we generate various types of morpheme vectors reflecting the research questions and then compare the classification accuracy through a non-static CNN(Convolutional Neural Network) model taking in the morpheme vectors. As for training and test datasets, Naver Shopping's 17,260 cosmetics product reviews are used. To derive morpheme vectors, we use data from the same domain as the target one and data from other domain; Naver shopping's about 2 million cosmetics product reviews and 520,000 Naver News data arguably corresponding to Google's News data. The six primary sets of morpheme vectors constructed in this study differ in terms of the following three criteria. First, they come from two types of data source; Naver news of high grammatical correctness and Naver shopping's cosmetics product reviews of low grammatical correctness. Second, they are distinguished in the degree of data preprocessing, namely, only splitting sentences or up to additional spelling and spacing corrections after sentence separation. Third, they vary concerning the form of input fed into a word vector model; whether the morphemes themselves are entered into a word vector model or with their POS tags attached. The morpheme vectors further vary depending on the consideration range of POS tags, the minimum frequency of morphemes included, and the random initialization range. All morpheme vectors are derived through CBOW(Continuous Bag-Of-Words) model with the context window 5 and the vector dimension 300. It seems that utilizing the same domain text even with a lower degree of grammatical correctness, performing spelling and spacing corrections as well as sentence splitting, and incorporating morphemes of any POS tags including incomprehensible category lead to the better classification accuracy. The POS tag attachment, which is devised for the high proportion of homonyms in Korean, and the minimum frequency standard for the morpheme to be included seem not to have any definite influence on the classification accuracy.

A Study on User's Requirement Analysis for Improvement of OASIS (한의학술논문검색시스템 기능개선을 위한 사용자 요구 분석에 관한 연구)

  • Han, Jeong-Min;Bae, Sun-Hee;Song, Mi-Young
    • Journal of Information Management
    • /
    • v.40 no.3
    • /
    • pp.79-97
    • /
    • 2009
  • Thanks to current development of many search engines and web technologies, a new semantic searching technology appears, featuring giving a relevant meaning to the keyword beyond the previous keyword search service. On the wave of advance of various search engines, the enhancement of OASIS offered by KIOM is needed as well. To do this, KIOM examined demographic and sociological analysis on their position, status, and career, the convenience of OASIS, and the value of papers offered in OASIS from members who have ever used it. Furthermore, the importance of each area involved in oriental medicine is also examined in terms of a new direction for OASIS improvement. Based on the result of the user survey, it turned out that not only an automatic search system that can find meaning of chinese character-centered key words but also a Authority-system which can distinguish homonym beyond simple keyword search system should be introduced quickly. Also, we reached the conclusion that it is necessary to interconnect a citation index information on references with laboratory information of the agencies concerned and interconnect major web sites around the world by using Open API. OASIS is the only domestic web site for offering papers that cover oriental medicine. Therefore, if requirements about the site in oriental medical circles are analyzed sufficiently and the problems of its information search system are improved, OASIS is expected to play a critical role in the development of oriental medicine.

A cytotaxonomic study of Allium (Alliaceae) sect. Sacculiferum in Korea (한국산 부추속 산부추절의 세포분류학적 연구)

  • Ko, Eun-Mi;Choi, Hyeok-Jae;Oh, Byoung-Un
    • Korean Journal of Plant Taxonomy
    • /
    • v.39 no.3
    • /
    • pp.170-180
    • /
    • 2009
  • Somatic chromosome counts and karyotype analyses were carried out for eight taxa of Korean Allium sect. Sacculiferum. The basic chromosome number of sect. Sacculiferum was x = 8, and they could be cytologically divided into two groups, that is, a diploid group (2n = 2x = 16) containing A. thunbergii var. thunbergii, A. thunbergii var. deltoides, A. thunbergii var. teretifistulosum, A. deltoidefistulosum, A. longistylum, A. linearifolium and A. taqueti, and a tetraploid group (2n = 4x = 32) with only A. sacculiferum. All observed chromosomes were classified into metacentric, submetacentric and subtelocentric. The metacentric ones appeared in all treated taxa. One or two pairs of submetacentric chromosomes were observed in most taxa except A. sacculiferum, the unique taxon with subtelocentric chromosomes. All taxa had a pair of homologous chromosomes with satellites, and the B-chromosomes found in A. thunbergii var. thunbergii, A. deltoidefistulosum, A. sacculiferum and A. longistylum, were metacentric or telocentric. The karyotypes of A. longistylum and A. linearifolium were firstly investigated in this study. In conclusion, the somatic chromosome numbers and karyotypes for members of the sect. Sacculiferum were valuable characters in identifying taxa, investigating interspecific relationships and delimiting taxa. In addition, A. thunbergii var. teretifolium, an invalid name (homonym), was renamed as A. thunbergii var. teretifistulosum H. J. Choi & B. U. Oh.

Dynamic Virtual Ontology using Tags with Semantic Relationship on Social-web to Support Effective Search (효율적 자원 탐색을 위한 소셜 웹 태그들을 이용한 동적 가상 온톨로지 생성 연구)

  • Lee, Hyun Jung;Sohn, Mye
    • Journal of Intelligence and Information Systems
    • /
    • v.19 no.1
    • /
    • pp.19-33
    • /
    • 2013
  • In this research, a proposed Dynamic Virtual Ontology using Tags (DyVOT) supports dynamic search of resources depending on user's requirements using tags from social web driven resources. It is general that the tags are defined by annotations of a series of described words by social users who usually tags social information resources such as web-page, images, u-tube, videos, etc. Therefore, tags are characterized and mirrored by information resources. Therefore, it is possible for tags as meta-data to match into some resources. Consequently, we can extract semantic relationships between tags owing to the dependency of relationships between tags as representatives of resources. However, to do this, there is limitation because there are allophonic synonym and homonym among tags that are usually marked by a series of words. Thus, research related to folksonomies using tags have been applied to classification of words by semantic-based allophonic synonym. In addition, some research are focusing on clustering and/or classification of resources by semantic-based relationships among tags. In spite of, there also is limitation of these research because these are focusing on semantic-based hyper/hypo relationships or clustering among tags without consideration of conceptual associative relationships between classified or clustered groups. It makes difficulty to effective searching resources depending on user requirements. In this research, the proposed DyVOT uses tags and constructs ontologyfor effective search. We assumed that tags are extracted from user requirements, which are used to construct multi sub-ontology as combinations of tags that are composed of a part of the tags or all. In addition, the proposed DyVOT constructs ontology which is based on hierarchical and associative relationships among tags for effective search of a solution. The ontology is composed of static- and dynamic-ontology. The static-ontology defines semantic-based hierarchical hyper/hypo relationships among tags as in (http://semanticcloud.sandra-siegel.de/) with a tree structure. From the static-ontology, the DyVOT extracts multi sub-ontology using multi sub-tag which are constructed by parts of tags. Finally, sub-ontology are constructed by hierarchy paths which contain the sub-tag. To create dynamic-ontology by the proposed DyVOT, it is necessary to define associative relationships among multi sub-ontology that are extracted from hierarchical relationships of static-ontology. The associative relationship is defined by shared resources between tags which are linked by multi sub-ontology. The association is measured by the degree of shared resources that are allocated into the tags of sub-ontology. If the value of association is larger than threshold value, then associative relationship among tags is newly created. The associative relationships are used to merge and construct new hierarchy the multi sub-ontology. To construct dynamic-ontology, it is essential to defined new class which is linked by two more sub-ontology, which is generated by merged tags which are highly associative by proving using shared resources. Thereby, the class is applied to generate new hierarchy with extracted multi sub-ontology to create a dynamic-ontology. The new class is settle down on the ontology. So, the newly created class needs to be belong to the dynamic-ontology. So, the class used to new hyper/hypo hierarchy relationship between the class and tags which are linked to multi sub-ontology. At last, DyVOT is developed by newly defined associative relationships which are extracted from hierarchical relationships among tags. Resources are matched into the DyVOT which narrows down search boundary and shrinks the search paths. Finally, we can create the DyVOT using the newly defined associative relationships. While static data catalog (Dean and Ghemawat, 2004; 2008) statically searches resources depending on user requirements, the proposed DyVOT dynamically searches resources using multi sub-ontology by parallel processing. In this light, the DyVOT supports improvement of correctness and agility of search and decreasing of search effort by reduction of search path.