Korean Word Sense Disambiguation using Dictionary and Corpus (사전과 말뭉치를 이용한 한국어 단어 중의성 해소)
-
- Journal of Intelligence and Information Systems
- /
- v.21 no.1
- /
- pp.1-13
- /
- 2015
As opinion mining in big data applications has been highlighted, a lot of research on unstructured data has made. Lots of social media on the Internet generate unstructured or semi-structured data every second and they are often made by natural or human languages we use in daily life. Many words in human languages have multiple meanings or senses. In this result, it is very difficult for computers to extract useful information from these datasets. Traditional web search engines are usually based on keyword search, resulting in incorrect search results which are far from users' intentions. Even though a lot of progress in enhancing the performance of search engines has made over the last years in order to provide users with appropriate results, there is still so much to improve it. Word sense disambiguation can play a very important role in dealing with natural language processing and is considered as one of the most difficult problems in this area. Major approaches to word sense disambiguation can be classified as knowledge-base, supervised corpus-based, and unsupervised corpus-based approaches. This paper presents a method which automatically generates a corpus for word sense disambiguation by taking advantage of examples in existing dictionaries and avoids expensive sense tagging processes. It experiments the effectiveness of the method based on Naïve Bayes Model, which is one of supervised learning algorithms, by using Korean standard unabridged dictionary and Sejong Corpus. Korean standard unabridged dictionary has approximately 57,000 sentences. Sejong Corpus has about 790,000 sentences tagged with part-of-speech and senses all together. For the experiment of this study, Korean standard unabridged dictionary and Sejong Corpus were experimented as a combination and separate entities using cross validation. Only nouns, target subjects in word sense disambiguation, were selected. 93,522 word senses among 265,655 nouns and 56,914 sentences from related proverbs and examples were additionally combined in the corpus. Sejong Corpus was easily merged with Korean standard unabridged dictionary because Sejong Corpus was tagged based on sense indices defined by Korean standard unabridged dictionary. Sense vectors were formed after the merged corpus was created. Terms used in creating sense vectors were added in the named entity dictionary of Korean morphological analyzer. By using the extended named entity dictionary, term vectors were extracted from the input sentences and then term vectors for the sentences were created. Given the extracted term vector and the sense vector model made during the pre-processing stage, the sense-tagged terms were determined by the vector space model based word sense disambiguation. In addition, this study shows the effectiveness of merged corpus from examples in Korean standard unabridged dictionary and Sejong Corpus. The experiment shows the better results in precision and recall are found with the merged corpus. This study suggests it can practically enhance the performance of internet search engines and help us to understand more accurate meaning of a sentence in natural language processing pertinent to search engines, opinion mining, and text mining. Naïve Bayes classifier used in this study represents a supervised learning algorithm and uses Bayes theorem. Naïve Bayes classifier has an assumption that all senses are independent. Even though the assumption of Naïve Bayes classifier is not realistic and ignores the correlation between attributes, Naïve Bayes classifier is widely used because of its simplicity and in practice it is known to be very effective in many applications such as text classification and medical diagnosis. However, further research need to be carried out to consider all possible combinations and/or partial combinations of all senses in a sentence. Also, the effectiveness of word sense disambiguation may be improved if rhetorical structures or morphological dependencies between words are analyzed through syntactic analysis.
According to the review and analysis of medical cases that are assigned to the Supreme Court and all local High Court in 2011 and that are presented in the media, it was found that the following categories were taken seriously, medical and pharmaceutical product liability, the third principle of trust between medical institutions, negligence and causation estimation, responsibility limit, the meaning of medical records and related judgment of disturbed substantiation, Oriental doctors' duties to explain the procedures, IMS events, whether one can claim for each medical care operated by non-physician health care institutions to the nonmedical domain in the National Health Insurance Corporation, and the basis of norms for each claim. In the cases related to medical pharmaceutical product liability, Supreme Court alleviated burden of proof for accidents with medical and pharmaceutical products prior to the practice of Product Liability Law and onset the point of negative prescription as the time of damage strikes to condition feasibility of the specific situation. In the cases related to the 3rd principle of trust between medical institutions, the Supreme Court refused to sentence the doctor who has trusted the judgment of the same third-party doctors the violations of the care duty. With respect to proof of a causal relationship and damages in a medical negligence case, the Supreme Court decided that it is unjust to deny negligence by the materials of causal relationship rejecting the original verdict and clarified that the causal relationship shall not deny the reasons to limit doctors' responsibilities. In order not put burden on patients with disadvantages in which medical records and the description of the practice or the most fundamental and important evidence to prove negligence and causation are being neglected, the Supreme Court admitted in the hospital's responsibility for the case of the neonate death of suffocation without properly listed fetal heart rate and uterine contraction monitor. On the other hand, the Seoul Western District Court has admitted alimony for altering and forging medical records. With respect to doctors' obligations to description, the Supreme Court decided that it is necessary to explain the foreseen risks by the combination of oriental and western medicines emphasizing the right of patient's self-determination. However, questions have arisen whether it is realistically feasible or not. In a case of an unlicensed doctor performing intramuscular stimulation treatment (IMS), the Supreme Court put off its decision if it was an unlicensed medical practice as to put limitation of eastern and western medical practices, but it declared that IMS practice was an acupuncture treatment therefore the plaintiff's conduct being an illegal act. In the future, clear judgment on this matter should be made. With respect to the claim of bills from non-physical health care institutions, the Supreme Court decided to void it for the implementation of the arrangement is contrary to the commitments made in the medical law and therefore, it is invalid to claim. In addition, contrast to the private healthcare professionals, who are subject to redemption according to the National Healthcare Insurance Law, the Seoul High Court explicitly confirmed that the non-professionals who receive the tort operating profit must return the unjust enrichment and have the liability for damages. As mentioned above, a relatively wide range of topics were discussed in medical field of 2011. In Korea's health care environment undergoing complex changes day by day, it is expected to see more diverse and in-depth discussions striding out to the development in the field of health care.
The aim of this study was to establish plant regeneration from leaf explants of Sedum tosaense Makino, which is globally rare and endangered species. The leaf explants of S. tosaense were cultured on the MS medium supplemented with different concentration of BA and NAA for callus induction. Callus induction was showed the highest (100%) on MS medium containing
The goal of this paper is to investigate changes in North Korea's domestic and foreign policies through automated text analysis over North Korea represented in South Korean mass media. Based on that data, we then analyze the status of text mining research, using a text mining technique to find the topics, methods, and trends of text mining research. We also investigate the characteristics and method of analysis of the text mining techniques, confirmed by analysis of the data. In this study, R program was used to apply the text mining technique. R program is free software for statistical computing and graphics. Also, Text mining methods allow to highlight the most frequently used keywords in a paragraph of texts. One can create a word cloud, also referred as text cloud or tag cloud. This study proposes a procedure to find meaningful tendencies based on a combination of word cloud, and co-occurrence networks. This study aims to more objectively explore the images of North Korea represented in South Korean newspapers by quantitatively reviewing the patterns of language use related to North Korea from 2016. 11. 1 to 2019. 5. 23 newspaper big data. In this study, we divided into three periods considering recent inter - Korean relations. Before January 1, 2018, it was set as a Before Phase of Peace Building. From January 1, 2018 to February 24, 2019, we have set up a Peace Building Phase. The New Year's message of Kim Jong-un and the Olympics of Pyeong Chang formed an atmosphere of peace on the Korean peninsula. After the Hanoi Pease summit, the third period was the silence of the relationship between North Korea and the United States. Therefore, it was called Depression Phase of Peace Building. This study analyzes news articles related to North Korea of the Korea Press Foundation database(www.bigkinds.or.kr) through text mining, to investigate characteristics of the Kim Jong-un regime's South Korea policy and unification discourse. The main results of this study show that trends in the North Korean national policy agenda can be discovered based on clustering and visualization algorithms. In particular, it examines the changes in the international circumstances, domestic conflicts, the living conditions of North Korea, the South's Aid project for the North, the conflicts of the two Koreas, North Korean nuclear issue, and the North Korean refugee problem through the co-occurrence word analysis. It also offers an analysis of South Korean mentality toward North Korea in terms of the semantic prosody. In the Before Phase of Peace Building, the results of the analysis showed the order of 'Missiles', 'North Korea Nuclear', 'Diplomacy', 'Unification', and ' South-North Korean'. The results of Peace Building Phase are extracted the order of 'Panmunjom', 'Unification', 'North Korea Nuclear', 'Diplomacy', and 'Military'. The results of Depression Phase of Peace Building derived the order of 'North Korea Nuclear', 'North and South Korea', 'Missile', 'State Department', and 'International'. There are 16 words adopted in all three periods. The order is as follows: 'missile', 'North Korea Nuclear', 'Diplomacy', 'Unification', 'North and South Korea', 'Military', 'Kaesong Industrial Complex', 'Defense', 'Sanctions', 'Denuclearization', 'Peace', 'Exchange and Cooperation', and 'South Korea'. We expect that the results of this study will contribute to analyze the trends of news content of North Korea associated with North Korea's provocations. And future research on North Korean trends will be conducted based on the results of this study. We will continue to study the model development for North Korea risk measurement that can anticipate and respond to North Korea's behavior in advance. We expect that the text mining analysis method and the scientific data analysis technique will be applied to North Korea and unification research field. Through these academic studies, I hope to see a lot of studies that make important contributions to the nation.
At the initial stage of Internet advertising, banner advertising came into fashion. As the Internet developed into a central part of daily lives and the competition in the on-line advertising market was getting fierce, there was not enough space for banner advertising, which rushed to portal sites only. All these factors was responsible for an upsurge in advertising prices. Consequently, the high-cost and low-efficiency problems with banner advertising were raised, which led to an emergence of keyword advertising as a new type of Internet advertising to replace its predecessor. In the beginning of 2000s, when Internet advertising came to be activated, display advertisement including banner advertising dominated the Net. However, display advertising showed signs of gradual decline, and registered minus growth in the year 2009, whereas keyword advertising showed rapid growth and started to outdo display advertising as of the year 2005. Keyword advertising refers to the advertising technique that exposes relevant advertisements on the top of research sites when one searches for a keyword. Instead of exposing advertisements to unspecified individuals like banner advertising, keyword advertising, or targeted advertising technique, shows advertisements only when customers search for a desired keyword so that only highly prospective customers are given a chance to see them. In this context, it is also referred to as search advertising. It is regarded as more aggressive advertising with a high hit rate than previous advertising in that, instead of the seller discovering customers and running an advertisement for them like TV, radios or banner advertising, it exposes advertisements to visiting customers. Keyword advertising makes it possible for a company to seek publicity on line simply by making use of a single word and to achieve a maximum of efficiency at a minimum cost. The strong point of keyword advertising is that customers are allowed to directly contact the products in question through its more efficient advertising when compared to the advertisements of mass media such as TV and radio, etc. The weak point of keyword advertising is that a company should have its advertisement registered on each and every portal site and finds it hard to exercise substantial supervision over its advertisement, there being a possibility of its advertising expenses exceeding its profits. Keyword advertising severs as the most appropriate methods of advertising for the sales and publicity of small and medium enterprises which are in need of a maximum of advertising effect at a low advertising cost. At present, keyword advertising is divided into CPC advertising and CPM advertising. The former is known as the most efficient technique, which is also referred to as advertising based on the meter rate system; A company is supposed to pay for the number of clicks on a searched keyword which users have searched. This is representatively adopted by Overture, Google's Adwords, Naver's Clickchoice, and Daum's Clicks, etc. CPM advertising is dependent upon the flat rate payment system, making a company pay for its advertisement on the basis of the number of exposure, not on the basis of the number of clicks. This method fixes a price for advertisement on the basis of 1,000-time exposure, and is mainly adopted by Naver's Timechoice, Daum's Speciallink, and Nate's Speedup, etc, At present, the CPC method is most frequently adopted. The weak point of the CPC method is that advertising cost can rise through constant clicks from the same IP. If a company makes good use of strategies for maximizing the strong points of keyword advertising and complementing its weak points, it is highly likely to turn its visitors into prospective customers. Accordingly, an advertiser should make an analysis of customers' behavior and approach them in a variety of ways, trying hard to find out what they want. With this in mind, her or she has to put multiple keywords into use when running for ads. When he or she first runs an ad, he or she should first give priority to which keyword to select. The advertiser should consider how many individuals using a search engine will click the keyword in question and how much money he or she has to pay for the advertisement. As the popular keywords that the users of search engines are frequently using are expensive in terms of a unit cost per click, the advertisers without much money for advertising at the initial phrase should pay attention to detailed keywords suitable to their budget. Detailed keywords are also referred to as peripheral keywords or extension keywords, which can be called a combination of major keywords. Most keywords are in the form of texts. The biggest strong point of text-based advertising is that it looks like search results, causing little antipathy to it. But it fails to attract much attention because of the fact that most keyword advertising is in the form of texts. Image-embedded advertising is easy to notice due to images, but it is exposed on the lower part of a web page and regarded as an advertisement, which leads to a low click through rate. However, its strong point is that its prices are lower than those of text-based advertising. If a company owns a logo or a product that is easy enough for people to recognize, the company is well advised to make good use of image-embedded advertising so as to attract Internet users' attention. Advertisers should make an analysis of their logos and examine customers' responses based on the events of sites in question and the composition of products as a vehicle for monitoring their behavior in detail. Besides, keyword advertising allows them to analyze the advertising effects of exposed keywords through the analysis of logos. The logo analysis refers to a close analysis of the current situation of a site by making an analysis of information about visitors on the basis of the analysis of the number of visitors and page view, and that of cookie values. It is in the log files generated through each Web server that a user's IP, used pages, the time when he or she uses it, and cookie values are stored. The log files contain a huge amount of data. As it is almost impossible to make a direct analysis of these log files, one is supposed to make an analysis of them by using solutions for a log analysis. The generic information that can be extracted from tools for each logo analysis includes the number of viewing the total pages, the number of average page view per day, the number of basic page view, the number of page view per visit, the total number of hits, the number of average hits per day, the number of hits per visit, the number of visits, the number of average visits per day, the net number of visitors, average visitors per day, one-time visitors, visitors who have come more than twice, and average using hours, etc. These sites are deemed to be useful for utilizing data for the analysis of the situation and current status of rival companies as well as benchmarking. As keyword advertising exposes advertisements exclusively on search-result pages, competition among advertisers attempting to preoccupy popular keywords is very fierce. Some portal sites keep on giving priority to the existing advertisers, whereas others provide chances to purchase keywords in question to all the advertisers after the advertising contract is over. If an advertiser tries to rely on keywords sensitive to seasons and timeliness in case of sites providing priority to the established advertisers, he or she may as well make a purchase of a vacant place for advertising lest he or she should miss appropriate timing for advertising. However, Naver doesn't provide priority to the existing advertisers as far as all the keyword advertisements are concerned. In this case, one can preoccupy keywords if he or she enters into a contract after confirming the contract period for advertising. This study is designed to take a look at marketing for keyword advertising and to present effective strategies for keyword advertising marketing. At present, the Korean CPC advertising market is virtually monopolized by Overture. Its strong points are that Overture is based on the CPC charging model and that advertisements are registered on the top of the most representative portal sites in Korea. These advantages serve as the most appropriate medium for small and medium enterprises to use. However, the CPC method of Overture has its weak points, too. That is, the CPC method is not the only perfect advertising model among the search advertisements in the on-line market. So it is absolutely necessary that small and medium enterprises including independent shopping malls should complement the weaknesses of the CPC method and make good use of strategies for maximizing its strengths so as to increase their sales and to create a point of contact with customers.
The flow theory becomes one of the most important frameworks in the internet research arena. Hoffman and Novak proposed a hierarchical flow model showing the antecedents and outcomes of flow and the relationship among these variables in the hyper-media computer circumstances (Hoffman and Novak 1996). This model was further tested after their initial research (Novak, Hoffman, and Yung 2000). At their paper, Hoffman and Novak explained that the balance of challenge and skill leads to flow which means the positive optimal state of mind (Hoffman and Novak 1996). An imbalance between challenge and skill, leads to negative states of mind like anxiety, boredom, apathy (Csikszentmihalyi and Csikszentmihalyi 1988). Almost all research on the flow 4-channel model have been focusingon flow, the positive state of mind (Ellis, Voelkl, and Morris 1994 Mathwick and Rigdon 2004). However, it also needs to examine the formation of the negative states of minds and their outcomes. Flow researchers explain play or playfulness as antecedents or the early state of flow. However, play has been regarded as a distinct concept from flow in the flow literatures (Hoffman and Novak 1996; Novak, Hoffman, and Yung 2000). Mathwick and Rigdon discovered the influences of challenge and skill on play; they also observed the influence of play on web-loyalty and brand loyalty (Mathwick and Rigdon 2004). Unfortunately, they did not go so far as to test the influences of play on state of mind. This study focuses on the relationships between state of mind in the flow 4-channel model and play. Early research has attempted to hypothetically explain state of mind in flow theory, but has not been tested except flow until now. Also the importance of play has been emphasized in the flow theory, but has not been tested in the flow 4-channel model context. This researcher attempts to analyze the relationships among state of mind, skill of play, challenge, state of mind and web loyalty. For this objective, I developed a measure for state of mind and defined the concept of play as a trait. Then, the influences of challenge and skill on the state of mind and play under on-line shopping conditions were tested. Also the influences of play on state of mind were tested and those of flow and play on web loyalty were highlighted. 294 undergraduate students participated in this research survey. They were asked to respond about their perceptions of challenge, skill, state of mind, play, and web-loyalty to on-line shopping mall. Respondents were restricted to students who bought products on-line in a month. In case of buying products at two or more on-line shopping malls, they asked to respond about the shopping mall where they bought the most important one. Construct validity, discriminant validity, and convergent validity were used to check the measurement validations. Also, Cronbach's alpha was used to check scale reliability. A series of exploratory factor analyses was conducted. This researcher conducted confirmatory factor analyses to assess the validity of measurements. All items loaded significantly on their respective constructs. Also, all reliabilities were greater than.70. Chi-square difference tests and goodness of fit tests supported discriminant and convergent validity. The results of clustering and ANOVA showed that high challenge and high skill leaded to flow, low challenge and high skill leaded to boredom, and low challenge and low skill leaded to apathy. But, it was different from my expectation that high challenge and low skill didnot lead to anxiety but leaded to apathy. The results also showed that high challenge and high skill, and high challenge and low skill leaded to the highest play. Low challenge leaded to low play. 4 Structural Equation Models were built by flow, anxiety, boredom, apathy for analyzing not only the impact of play on state of mind and web-loyalty, but also that of state of mind on web-loyalty. According the analyses results of these models, play impacted flow and web-loyalty positively, but impacted anxiety, boredom, and apathy negatively. Results also showed that flow impacted web-loyalty positively, but anxiety, boredom, and apathy impacted web-loyalty negatively. The interpretations and implications of the test results of the hypotheses are as follows. First, respondents belonging to different clusters based on challenge and skill level experienced different states of mind such as flow, anxiety, boredom, apathy. The low challenge and low skill group felt the highest anxiety and apathy. It could be interpreted that this group feeling high anxiety or fear, then avoided attempts to shop on-line. Second, it was found that higher challenge leads to higher levels of play. Test results show that the play level of the high challenge and low skill group (anxiety group) was higher than that of the high challenge and high skill group (flow group). However, this was not significant. Third, play positively impacted flow and negatively impacted boredom. The negative impacts on anxiety and apathy were not significant. This means that the combination of challenge and skill creates different results. Forth, play and flow positively impacted web-loyalty, but anxiety, boredom, apathy had negative impacts. The effect of play on web-loyalty was stronger in case of anxiety, boredom, apathy group than fl ow group. These results show that challenge and skill influences state of mind and play. Results also demonstrate how play and flow influence web-loyalty. It implies that state of mind and play should be the core marketing variables in internet marketing. The flow theory has been focusing on flow and on the positive outcomes of flow experiences. But, this research shows that lots of consumers experience the negative state of mind rather than flow state in the internet shopping circumstance. Results show that the negative state of mind leads to low or negative web-loyalty. Play can have an important role with the web-loyalty when consumers have the negative state of mind. Results of structural equation model analyses show that play influences web-loyalty positively, even though consumers may be in the negative state of mind. This research found the impacts of challenge and skill on state of mind in the flow 4-channel model, not only flow but also anxiety, boredom, apathy. Also, it highlighted the role of play in the flow 4-channel model context and impacts on web-loyalty. However, tests show a few different results from hypothetical expectations such as the highest anxiety level of apathy group and insignificant impacts of play on anxiety and apathy. Further research needs to replicate this research and/or to compare 3-channel model with 4-channel model.
The present study is an attempt to solve the basic problems involved in the control of the Sclerotium disease. The biologic stranis of Sclerotium rolfsii Sacc., pathogen of Sclerotium disease of Magnolia kobus, were differentiated, and the effects of vitamins, various nitrogen and carbon sources on its mycelial growth and sclerotial production have been investigated. In addition the relationship between the cultural filtrate of Penicillium sp. and the growth of Sclerotium rolfsii, the tolerance of its mycelia or sclerotia to moist heat or drought and to Benlate (methyl-(butylcarbamoy 1)-2-benzimidazole carbamate), Tachigaren (3-hydroxy-5-methylisoxazole) and other chemicals were also clarified. The results are summarizee as follows: 1. There were two biologic strains, Type-l and Type-2 among isolates. They differed from each other in the mode of growth and colonial appearance on the media, aversion phenomenon and in their pathogenicity. These two types had similar pathogenicity to the Magnolia kobus and Robinia pseudoacasia, but behaved somewhat differently to the soybaen and cucumber, the Type-l being more virulent. 2. Except potassium nitrite, sodium nitrite and glycine, all of the 12 nitrogen sources tested were utilized for the mycelial growth and sclerotial production of this fungus when 10r/l of thiamine hydrochloride was added in the culture solution. Considering the forms of nitrogen, ammonium nitrogen was more available than nitrate nitrogen for the growth of mycelia, but nitrate nitrogen was better for sclerotia formation. Organic nitrogen showed different availabilities according to compounds used. While nitrite nitrogen was unavailable for both mycelial growth and sclerotial formation whether thiamine hydrochlioride was added or not. 3. Seven kinds of carbon sources examined were not effective in general, as long as thiamine hydrochloride was not added. When thiamine hydrochloride was added, glucose and saccharose exhibited mycelial growth, while rnaltose and soluble starch gave lesser, and xylose, lactose, and glycine showed no effect at all,. In the sclerotial production, all the tested carbon sources, except lactose, were effective, and glucose, maltose, saccharose, and soluble starch gave better results. 4. At the same level of nitrogen, the amount of mycelial growth increased as more carbon Sources were applied but decreased with the increase of nitrogen above 0.5g/1. The amount of sclerotial production decreased wi th the increase of carbon sources. 5. Sclerotium rolfsii was thiamine-defficient and required thiamine 20r/l for maximun growth of mycelia. At a higher concentration of more than 20r/l, however, mycelial growth decreased as the concentration increased, and was inhibited at l50r/l to such a degree of thiamine-free. 6. The effect of the nitrogen sources on the mycelial growth under the presence of thiamine were recognized in the decreasing order of