Search | Korea Science

A Case Study on Universal Dependency Tagsets (다국어 범용 의존관계 주석체계(Universal Dependencies) 적용 연구 - 한국어와 일본어의 비교를 중심으로)

Han, Jiyoon;Lee, Jin;Lee, Chanyoung;Kim, Hansaem
- Cross-Cultural Studies
- /
- v.53
- /
- pp.163-192
- /
- 2018
The purpose of this paper was to examine universal dependency UD application cases of Korean and Japanese with similar morphological characteristics. In addition, UD application and improvement methods of Korean were examined through comparative analysis. Korean and Japanese are very well developed due to their agglutinative characteristics. Therefore, there are many difficulties to apply UD which is built around English refraction. We examined the application of UPOS and DEPREL as components of UD with discussions. In UPOS, we looked at category problem related to narrative such as AUX, ADJ, and VERB, We examined how to handle units. In relation to the DEPREL annotation system, we discussed how to reflect syntactic problem from the basic unit annotation of syntax tags. We investigated problems of case and aux arising from the problem of setting dominant position from Korean and Japanese as the dominant language. We also investigated problems of annotation of parallel structure and setting of annotation basic unit. Among various relation annotation tags, case and aux are discussed because they show the most noticeable difference in distribution when comparing annotation tag application patterns with Korean. The case is related to both Korean and Japanese surveys. Aux is a secondary verb in Korean and an auxiliary verb in Japanese. As a result of examining specific annotation patterns, it was found that Japanese aux not only assigned auxiliary clauses, but also auxiliary elements to add the grammatical meaning to the verb and form corresponding to the end of Korean. In UD annotation of Japanese, the basic unit of morphological analysis is defined as a unit of basic syntactic annotation in Japanese UD annotation. Thus, when using information, it is necessary to consider how to use morphological analysis unit as information of dependency annotation in Korean.

An Investigation on the Periodical Transition of News related to North Korea using Text Mining (텍스트마이닝을 활용한 북한 관련 뉴스의 기간별 변화과정 고찰)

Park, Chul-Soo
- Journal of Intelligence and Information Systems
- /
- v.25 no.3
- /
- pp.63-88
- /
- 2019
The goal of this paper is to investigate changes in North Korea's domestic and foreign policies through automated text analysis over North Korea represented in South Korean mass media. Based on that data, we then analyze the status of text mining research, using a text mining technique to find the topics, methods, and trends of text mining research. We also investigate the characteristics and method of analysis of the text mining techniques, confirmed by analysis of the data. In this study, R program was used to apply the text mining technique. R program is free software for statistical computing and graphics. Also, Text mining methods allow to highlight the most frequently used keywords in a paragraph of texts. One can create a word cloud, also referred as text cloud or tag cloud. This study proposes a procedure to find meaningful tendencies based on a combination of word cloud, and co-occurrence networks. This study aims to more objectively explore the images of North Korea represented in South Korean newspapers by quantitatively reviewing the patterns of language use related to North Korea from 2016. 11. 1 to 2019. 5. 23 newspaper big data. In this study, we divided into three periods considering recent inter - Korean relations. Before January 1, 2018, it was set as a Before Phase of Peace Building. From January 1, 2018 to February 24, 2019, we have set up a Peace Building Phase. The New Year's message of Kim Jong-un and the Olympics of Pyeong Chang formed an atmosphere of peace on the Korean peninsula. After the Hanoi Pease summit, the third period was the silence of the relationship between North Korea and the United States. Therefore, it was called Depression Phase of Peace Building. This study analyzes news articles related to North Korea of the Korea Press Foundation database(www.bigkinds.or.kr) through text mining, to investigate characteristics of the Kim Jong-un regime's South Korea policy and unification discourse. The main results of this study show that trends in the North Korean national policy agenda can be discovered based on clustering and visualization algorithms. In particular, it examines the changes in the international circumstances, domestic conflicts, the living conditions of North Korea, the South's Aid project for the North, the conflicts of the two Koreas, North Korean nuclear issue, and the North Korean refugee problem through the co-occurrence word analysis. It also offers an analysis of South Korean mentality toward North Korea in terms of the semantic prosody. In the Before Phase of Peace Building, the results of the analysis showed the order of 'Missiles', 'North Korea Nuclear', 'Diplomacy', 'Unification', and ' South-North Korean'. The results of Peace Building Phase are extracted the order of 'Panmunjom', 'Unification', 'North Korea Nuclear', 'Diplomacy', and 'Military'. The results of Depression Phase of Peace Building derived the order of 'North Korea Nuclear', 'North and South Korea', 'Missile', 'State Department', and 'International'. There are 16 words adopted in all three periods. The order is as follows: 'missile', 'North Korea Nuclear', 'Diplomacy', 'Unification', 'North and South Korea', 'Military', 'Kaesong Industrial Complex', 'Defense', 'Sanctions', 'Denuclearization', 'Peace', 'Exchange and Cooperation', and 'South Korea'. We expect that the results of this study will contribute to analyze the trends of news content of North Korea associated with North Korea's provocations. And future research on North Korean trends will be conducted based on the results of this study. We will continue to study the model development for North Korea risk measurement that can anticipate and respond to North Korea's behavior in advance. We expect that the text mining analysis method and the scientific data analysis technique will be applied to North Korea and unification research field. Through these academic studies, I hope to see a lot of studies that make important contributions to the nation.
https://doi.org/10.13088/jiis.2019.25.3.063 인용 PDF KSCI

A Study on Flammability Risk of Flammable Liquid Mixture (가연성 액체 혼합물의 인화 위험성에 관한 연구)

Kim, Ju Suk;Koh, Jae Sun
- Journal of the Society of Disaster Information
- /
- v.16 no.4
- /
- pp.701-711
- /
- 2020
Purpose: In this study, the risk of flammability of a liquid mixture was experimentally confirmed because the purpose of this study was to confirm the increase or decrease of the flammability risk in a mixture of two substances (combustible+combustible) and to present the risk of the mixture. Method: Flash point test method and result processing were tested based on KS M 2010-2008, a tag sealing test method used as a flash point test method for crude oil and petroleum products. The manufacturer of the equipment used in this experiment was Japan's TANAKA. The flash point was measured with a test equipment that satisfies the test standards of KS M 2010 with equipment produced by the company, and LP gas was used as the ignition source and water as the cooling water. In addition, when measuring the flash point, the temperature of the cooling water was tested using cooling water of about 2℃. Results: First of all, in the case of flammable + combustible mixtures, there was little change in flash point if the flash point difference between the two substances was not large, and if the flash point difference between the two substances was low, the flash point tended to increase as the number of substances with high flash point increased. However, in the case of toluene and methanol, the flash point of the mixture was lower than that of the material with a lower flash point. Also, in the case of a paint thinner, it was not easy to predict the flash point of the material because it was composed of a mixture, but as a result of experimental measurement, it was measured between -24℃ and 7℃. Conclusion: The results of this study are to determine the risk of mixtures through experimental studies on flammable mixtures for the purpose of securing the effectiveness of the details of the criteria for determining dangerous goods in the existing dangerous goods safety management method and securing the reliability and reproducibility of the determination of dangerous goods Criteria have been presented, and reference data on experimental criteria for flammable liquids that are regulated in firefighting sites can be provided. In addition, if this study accumulates know-how on differences in test methods, it is expected that it can be used as a basis for research on risk assessment of dangerous goods and as a basis for research on dangerous goods determination.
https://doi.org/10.15683/kosdi.2020.12.31.701 인용 PDF KSCI

A Folksonomy Ranking Framework: A Semantic Graph-based Approach (폭소노미 사이트를 위한 랭킹 프레임워크 설계: 시맨틱 그래프기반 접근)

Park, Hyun-Jung;Rho, Sang-Kyu
- Asia pacific journal of information systems
- /
- v.21 no.2
- /
- pp.89-116
- /
- 2011
In collaborative tagging systems such as Delicious.com and Flickr.com, users assign keywords or tags to their uploaded resources, such as bookmarks and pictures, for their future use or sharing purposes. The collection of resources and tags generated by a user is called a personomy, and the collection of all personomies constitutes the folksonomy. The most significant need of the folksonomy users Is to efficiently find useful resources or experts on specific topics. An excellent ranking algorithm would assign higher ranking to more useful resources or experts. What resources are considered useful In a folksonomic system? Does a standard superior to frequency or freshness exist? The resource recommended by more users with mere expertise should be worthy of attention. This ranking paradigm can be implemented through a graph-based ranking algorithm. Two well-known representatives of such a paradigm are Page Rank by Google and HITS(Hypertext Induced Topic Selection) by Kleinberg. Both Page Rank and HITS assign a higher evaluation score to pages linked to more higher-scored pages. HITS differs from PageRank in that it utilizes two kinds of scores: authority and hub scores. The ranking objects of these pages are limited to Web pages, whereas the ranking objects of a folksonomic system are somewhat heterogeneous(i.e., users, resources, and tags). Therefore, uniform application of the voting notion of PageRank and HITS based on the links to a folksonomy would be unreasonable, In a folksonomic system, each link corresponding to a property can have an opposite direction, depending on whether the property is an active or a passive voice. The current research stems from the Idea that a graph-based ranking algorithm could be applied to the folksonomic system using the concept of mutual Interactions between entitles, rather than the voting notion of PageRank or HITS. The concept of mutual interactions, proposed for ranking the Semantic Web resources, enables the calculation of importance scores of various resources unaffected by link directions. The weights of a property representing the mutual interaction between classes are assigned depending on the relative significance of the property to the resource importance of each class. This class-oriented approach is based on the fact that, in the Semantic Web, there are many heterogeneous classes; thus, applying a different appraisal standard for each class is more reasonable. This is similar to the evaluation method of humans, where different items are assigned specific weights, which are then summed up to determine the weighted average. We can check for missing properties more easily with this approach than with other predicate-oriented approaches. A user of a tagging system usually assigns more than one tags to the same resource, and there can be more than one tags with the same subjectivity and objectivity. In the case that many users assign similar tags to the same resource, grading the users differently depending on the assignment order becomes necessary. This idea comes from the studies in psychology wherein expertise involves the ability to select the most relevant information for achieving a goal. An expert should be someone who not only has a large collection of documents annotated with a particular tag, but also tends to add documents of high quality to his/her collections. Such documents are identified by the number, as well as the expertise, of users who have the same documents in their collections. In other words, there is a relationship of mutual reinforcement between the expertise of a user and the quality of a document. In addition, there is a need to rank entities related more closely to a certain entity. Considering the property of social media that ensures the popularity of a topic is temporary, recent data should have more weight than old data. We propose a comprehensive folksonomy ranking framework in which all these considerations are dealt with and that can be easily customized to each folksonomy site for ranking purposes. To examine the validity of our ranking algorithm and show the mechanism of adjusting property, time, and expertise weights, we first use a dataset designed for analyzing the effect of each ranking factor independently. We then show the ranking results of a real folksonomy site, with the ranking factors combined. Because the ground truth of a given dataset is not known when it comes to ranking, we inject simulated data whose ranking results can be predicted into the real dataset and compare the ranking results of our algorithm with that of a previous HITS-based algorithm. Our semantic ranking algorithm based on the concept of mutual interaction seems to be preferable to the HITS-based algorithm as a flexible folksonomy ranking framework. Some concrete points of difference are as follows. First, with the time concept applied to the property weights, our algorithm shows superior performance in lowering the scores of older data and raising the scores of newer data. Second, applying the time concept to the expertise weights, as well as to the property weights, our algorithm controls the conflicting influence of expertise weights and enhances overall consistency of time-valued ranking. The expertise weights of the previous study can act as an obstacle to the time-valued ranking because the number of followers increases as time goes on. Third, many new properties and classes can be included in our framework. The previous HITS-based algorithm, based on the voting notion, loses ground in the situation where the domain consists of more than two classes, or where other important properties, such as "sent through twitter" or "registered as a friend," are added to the domain. Forth, there is a big difference in the calculation time and memory use between the two kinds of algorithms. While the matrix multiplication of two matrices, has to be executed twice for the previous HITS-based algorithm, this is unnecessary with our algorithm. In our ranking framework, various folksonomy ranking policies can be expressed with the ranking factors combined and our approach can work, even if the folksonomy site is not implemented with Semantic Web languages. Above all, the time weight proposed in this paper will be applicable to various domains, including social media, where time value is considered important.
PDF KSCI

Approximation of Multiple Trait Effective Daughter Contribution by Dairy Proven Bulls for MACE (젖소 국제유전능력 평가를 위한 종모우별 다형질 Effective Daughter Contribution 추정)

Cho, Kwang-Hyun;Choi, Tae-Jeong;Cho, Chung-Il;Park, Kyung-Do;Do, Kyoung-Tag;Oh, Jae-Don;Lee, Hak-Kyo;Kong, Hong-Sik;Lee, Joon-Ho
- Journal of Animal Science and Technology
- /
- v.55 no.5
- /
- pp.399-403
- /
- 2013
This study was conducted to investigate the basic concept of multiple trait effective daughter contribution (MTEDC) for dairy cattle sires and calculate effective daughter contribution (EDC) by applying a five lactation multiple trait model using milk yield test records of daughters for the Multiple-trait Across Country Evaluation (MACE). Milk yield data and pedigree information of 301,551 cows that were the progeny of 2,046 Korean and imported dairy bulls were collected from the National Agricultural Cooperative Federation and used in this study. For MTEDC approximation, the reliability of the breeding value was separated based on parents average, own yield deviation and mate adjusted progeny contribution. EDC was then calculated by lactation using these reliabilities. The average number of recorded daughters per sire by lactations were 140.57, 94.24, 55.14, 29.20 and 14.06 from the first to fifth lactation, respectively. However, the average EDC per sire by lactation using the five lactation multiple trait model was 113.49, 89.28, 73.56, 54.02 and 35.08 from the first to fifth lactation, respectively, while the decrease of EDC in late lactations was comparably lower than the average number of recorded daughters per sire. These findings indicate that the availability of daughters without late lactation records is increased by genetic correlation using the multiple trait model. Owing to the relatedness between the EDC and reliability of the estimated breeding value for sire, understanding the MTEDC algorithm and continuous monitoring of EDC is required for correct MACE application of the five lactation multiple trait model.
https://doi.org/10.5187/JAST.2013.55.5.399 인용 PDF KSCI

Variations in Temperature and Relative Humidity of Rough Rice in the Polypropylene Bulk Bag during Waiting Time for Drying (벌크 백 수확 벼의 건조대기 시간 중 온.습도 변화양상 구명)

Lee, Choon-Ki;Yun, Jong-Tag;Song, Jin;Jeong, Eung-Gi;Lee, Yu-Young;Kim, Wook-Han
- KOREAN JOURNAL OF CROP SCIENCE
- /
- v.55 no.4
- /
- pp.339-349
- /
- 2010
The uses of the polypropylene bulk bags having the loading capacities more than 500 kg are increasing in Korea recently as a storage container for rough rice. This study was performed to obtain the basic information on the changes of temperature and relative humidity in the bulk-bag-stored high moisture rough rice during waiting for drying. At the moisture content more than 22% on wet weight basis of paddy, the bulk-bag inside temperature rose up to more than $40^{\circ}C$ and then slid down during storage. For example, in case of Hwaseongbyeo, 26.5% moisture content of rough rice (MCRR) harvested at 46 days after heading (DAH) showed $54.5^{\circ}C$ of peak temperature at 66.8 hours after bulk-bag loading, 22.5% MCRR harvested at 52 DAH exhibited $42.0^{\circ}C$ at 81.1 hours, and 19.7% MCRR harvested at 55 DAH displayed $38.9^{\circ}C$ at 119.0 hours. There were a good linear relationship between peak temperatures of bulk-bag inside and moisture contents of paddy ($r^2$=0.89 in 2005, and 0.87 in 2006), while the slope and intercept of the linear regression equation was affected by the environmental conditions such as ambient temperatures and microbial flora. The peak temperatures increased with the rate of about $2.74-3.33^{\circ}C$ per every 1% increase of moisture content at higher moisture contents of paddy than 19%. The relative humidity varied depending on bulk-bag inside temperature and rough rice moisture content, and showed the range of 94.2% to 99.9% in the central point of the bulk-bag. The results suggested that a rapid drying treatment as soon as possible was needed to produce a good quality of rice when the paddy of high moisture more than 22% on wet basis was harvested in a bulk-bag especially at high ambient temperature.
PDF KSCI

Identification of Lettuce Germplasms and Commercial Cultivars Using SSR Markers Developed from EST (EST로부터 개발된 SSR 마커를 이용한 상추 유전자원 및 유통품종의 식별)

Hong, Jee-Hwa;Kwon, Yong-Sham;Choi, Keun-Jin;Mishra, Raghvendra Kumar;Kim, Doo Hwan
- Horticultural Science & Technology
- /
- v.31 no.6
- /
- pp.772-781
- /
- 2013
The objective of this study was to develop simple sequence repeat (SSR) markers from expressed sequence tags (EST) of lettuce (Lactuca sativa) and identify 9 germplasms from 3 wild species of lettuce and 61 commercial cultivars using the developed EST-SSR markers. A total of 81,330 lettuce ESTs from NCBI databases were used to search for SSR and 4,229 SSR loci were identified. The highest proportion (59.12%, 2500) was represented by trinucleotide, followed by dinucleotide (29.70%, 1256) and hexanucleotide (6.62%, 280) among SSR repeat motifs. Totally 474 EST-SSR primers were developed from EST and a random set of 267 primers was used to assess the genetic diversity among 9 germplasms and 61 cultivars. Out of 267 primers, 47 EST-SSR markers showed polymorphism between 7 cultivars. Twenty-six EST-SSR markers among 47 EST-SSR markers showed high polymorphism, reproducibility, and band clearance. The relationship between 26 markers genotypes and 70 accessions was analyzed. Totally 127 polymorphic amplified fragments were obtained by 26 EST-SSR markers and two to nine SSR alleles were detected for each locus with an average of 4.88 alleles per locus. Average polymorphism information content was 0.542, ranging from 0.269 to 0.768. Genetic distance of clusters ranged from 0.05 to 0.94 between 70 accessions and dendrogram at a similarity of 0.34 gave 7 main clusters. Analysis of genetic diversity revealed by these 26 EST-SSR markers showed that the 9 germplasms and 61 commercial cultivars were discriminated by marker genotypes. These newly developed EST-SSR markers will be useful for cultivar identification and distinctness, uniformity and stability test of lettuce.
https://doi.org/10.7235/hort.2013.13055 인용 PDF KSCI

A Study on Analysis of Investment Effects of Farm Mechanization, Korea -Mainly on the Case Study of Saemaeul Farm Mechanization Groups in Nonsan Area, Chungnam Province- (농업기계화(農業機械化)의 투자효과분석(投資效果分析)에 관(關)한 연구(硏究) -충남논산지역(忠南論山地域) 새마을 기계화영농단(機械化營農團)을 중심(中心)으로-)

Lim, Jae Hwan;Han, Gwan Soon
- Korean Journal of Agricultural Science
- /
- v.14 no.1
- /
- pp.164-185
- /
- 1987
The Korean economy has been developed rapidly in the course of implementing the five year economic development plans since 1962. Accordingly the industrial and employment structure have been changed from the traditional agriculture to modem industrial economy. In the course of implementing export oriented industrialization policies, rural farm economy has been encountered labour shortage owing to rural farm population drain to urban areas, rural wage hike and pressure on farm operation costs, and possibility of farm productivity decrease. To cope with the above problems the Korean government has supplied farm machinery such as power tillers, tractors, transplanters, binders, combines, dryers and etc. by means of the favorable credit support and subsidies. The main objectives of this study are to identify the investment effects of farm mechanization such as B/C and Internal Rate of Return by machinery and operation patterns, changes of labour requirement per 10a for rice culture since 1965, partial farm budget of rice with and without mechanization, and estimation labour input with full mechanization. To achieve the objectives Saemaeul farm mechanization groups, common ownership and operation, and farms with private ownership and operation were surveyed mainly in Nonsan granary area, Chungnam province. The results of this study are as follows 1. The national average of labor input per 10a of paddy has decreased from 150.1Hr in 1965 to 87.2Hr in 1985 which showes 42% decrease of labour inputs. On the other hand the hours of labour input in Nonsan area have also decreased from 150.1Hr to 92.8Hr, 38% of that in 1965, during the same periods. 2. The possible labor saving hours per 10a of Paddy was estimated at 60 hours by substituting machine power for labor forces in the works of plowing, puddling, transplanting, harvesting and threshing, transporting and drying The labor savings were derived from 92.8 hours in 1986 deducting 30 hours of labor input with full mechanization in Nonsan area. 3. Social benefits of farm mechanization were estimated at 124,734won/10a including increment of rice (10%): 34,064won,labour saving: 65,800won,savings of conventional farm implements: 18,000 won and savings of animal power: 6,870won. 4. Rental charges by works prevailing in the area were 12,000won for land preparation, 15,000won for transplanting with seedlings, 19,500won for combine works and 6,000won for drying paddy. 5. Farm income per 10a of paddy with and without mechanization were amounted to 247,278won and 224,768won respectively. 6. Social rate of return of the machinery were estimated at more than 50% in all operation patterns. On the other hand internal rate of return of the machinery except tractors were also more than 50% but IRR of tractors by operation patterns were equivalent to 0 to 9%. From the view point of farmers financial status, private owner-operation of tractors is considered uneconomical. Tractor operation by Saemaeul mechanization groups would be economical considering the government subsidy, 40% of tractor price. 7. Farmers recommendations for the government that gained through field operation of farm machinery are to train maintenance technology for rural youth, to standardize the necessary parts of machinery, to implement price tag system, to intercede spare parts and provide marketing information to farmers by rural institutions as RDA,NACF,GUN office and FLIA.
PDF

Polymorphisms and Allele Distribution of Novel Indel Markers in Jeju Black Cattle, Hanwoo and Imported Cattle Breeds (제주흑우, 한우 및 수입 소 품종에서 새로운 indel 마커의 다형성과 대립인자 분포)

Han, Sang-Hyun;Kim, Jae-Hwan;Cho, In-Cheol;Cho, Sang-Rae;Cho, Won-Mo;Kim, Sang-Geum;Kim, Yoo-Kyung;Kang, Yong-Jun;Park, Yong-Sang;Kim, Young-Hoon;Park, Se-Phil;Kim, Eun-Young;Lee, Sung-Soo;Ko, Moon-Suck
- Journal of Life Science
- /
- v.22 no.12
- /
- pp.1644-1650
- /
- 2012
The aim of this study was to screen the polymorphisms and distribution of each genotype of insertion/ deletion (indel) markers which were found in a preliminary comparative study of bovine genomic sequence databases. Comparative bioinformatic analyses were first performed between the nucleotide sequences of Bovine Genome Project and those of expressed sequence tag (EST) database, and a total of fifty-one species of indel markers were screened. Of these, forty-two indel markers were evaluated, and nine informative indel markers were ultimately selected for population analysis. Nucleotide sequences of each marker were re-sequenced and their polymorphic patterns were typed in six cattle breeds: Holstein, Angus, Charolais, Hereford, and two Korean native cattle breeds (Hanwoo and Jeju Black cattle). Cattle breeds tested in this study showed polymorphic patterns in eight indel markers but not in the Indel-15 marker in Charolais and Holstein. The results of analysis for Jeju Black cattle (JBC) population indicated an observed heterozygosity (Ho) that was highest in HW_G1 (0.600) and the lowest in Indel_29 (0.274). The PIC value was the highest in HW_G4 (0.373) and lowest in Indel_6 (0.305). These polymorphic indel markers will be useful in supplying genetic information for parentage tests and traceability and to develop a molecular breeding system for improvement of animal production in cattle breeds as well as in the JBC population.
https://doi.org/10.5352/JLS.2012.22.12.1644 인용 PDF KSCI

Sentiment Analysis of Korean Reviews Using CNN: Focusing on Morpheme Embedding (CNN을 적용한 한국어 상품평 감성분석: 형태소 임베딩을 중심으로)

Park, Hyun-jung;Song, Min-chae;Shin, Kyung-shik
- Journal of Intelligence and Information Systems
- /
- v.24 no.2
- /
- pp.59-83
- /
- 2018
With the increasing importance of sentiment analysis to grasp the needs of customers and the public, various types of deep learning models have been actively applied to English texts. In the sentiment analysis of English texts by deep learning, natural language sentences included in training and test datasets are usually converted into sequences of word vectors before being entered into the deep learning models. In this case, word vectors generally refer to vector representations of words obtained through splitting a sentence by space characters. There are several ways to derive word vectors, one of which is Word2Vec used for producing the 300 dimensional Google word vectors from about 100 billion words of Google News data. They have been widely used in the studies of sentiment analysis of reviews from various fields such as restaurants, movies, laptops, cameras, etc. Unlike English, morpheme plays an essential role in sentiment analysis and sentence structure analysis in Korean, which is a typical agglutinative language with developed postpositions and endings. A morpheme can be defined as the smallest meaningful unit of a language, and a word consists of one or more morphemes. For example, for a word '예쁘고', the morphemes are '예쁘(= adjective)' and '고(=connective ending)'. Reflecting the significance of Korean morphemes, it seems reasonable to adopt the morphemes as a basic unit in Korean sentiment analysis. Therefore, in this study, we use 'morpheme vector' as an input to a deep learning model rather than 'word vector' which is mainly used in English text. The morpheme vector refers to a vector representation for the morpheme and can be derived by applying an existent word vector derivation mechanism to the sentences divided into constituent morphemes. By the way, here come some questions as follows. What is the desirable range of POS(Part-Of-Speech) tags when deriving morpheme vectors for improving the classification accuracy of a deep learning model? Is it proper to apply a typical word vector model which primarily relies on the form of words to Korean with a high homonym ratio? Will the text preprocessing such as correcting spelling or spacing errors affect the classification accuracy, especially when drawing morpheme vectors from Korean product reviews with a lot of grammatical mistakes and variations? We seek to find empirical answers to these fundamental issues, which may be encountered first when applying various deep learning models to Korean texts. As a starting point, we summarized these issues as three central research questions as follows. First, which is better effective, to use morpheme vectors from grammatically correct texts of other domain than the analysis target, or to use morpheme vectors from considerably ungrammatical texts of the same domain, as the initial input of a deep learning model? Second, what is an appropriate morpheme vector derivation method for Korean regarding the range of POS tags, homonym, text preprocessing, minimum frequency? Third, can we get a satisfactory level of classification accuracy when applying deep learning to Korean sentiment analysis? As an approach to these research questions, we generate various types of morpheme vectors reflecting the research questions and then compare the classification accuracy through a non-static CNN(Convolutional Neural Network) model taking in the morpheme vectors. As for training and test datasets, Naver Shopping's 17,260 cosmetics product reviews are used. To derive morpheme vectors, we use data from the same domain as the target one and data from other domain; Naver shopping's about 2 million cosmetics product reviews and 520,000 Naver News data arguably corresponding to Google's News data. The six primary sets of morpheme vectors constructed in this study differ in terms of the following three criteria. First, they come from two types of data source; Naver news of high grammatical correctness and Naver shopping's cosmetics product reviews of low grammatical correctness. Second, they are distinguished in the degree of data preprocessing, namely, only splitting sentences or up to additional spelling and spacing corrections after sentence separation. Third, they vary concerning the form of input fed into a word vector model; whether the morphemes themselves are entered into a word vector model or with their POS tags attached. The morpheme vectors further vary depending on the consideration range of POS tags, the minimum frequency of morphemes included, and the random initialization range. All morpheme vectors are derived through CBOW(Continuous Bag-Of-Words) model with the context window 5 and the vector dimension 300. It seems that utilizing the same domain text even with a lower degree of grammatical correctness, performing spelling and spacing corrections as well as sentence splitting, and incorporating morphemes of any POS tags including incomprehensible category lead to the better classification accuracy. The POS tag attachment, which is devised for the high proportion of homonyms in Korean, and the minimum frequency standard for the morpheme to be included seem not to have any definite influence on the classification accuracy.
https://doi.org/10.13088/jiis.2018.24.2.059 인용 PDF KSCI

Search Result 1,581, Processing Time 0.043 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)