• Title/Summary/Keyword: Korean text classification

Search Result 413, Processing Time 0.024 seconds

Restoring Omitted Sentence Constituents in Encyclopedia Documents Using Structural SVM (Structural SVM을 이용한 백과사전 문서 내 생략 문장성분 복원)

  • Hwang, Min-Kook;Kim, Youngtae;Ra, Dongyul;Lim, Soojong;Kim, Hyunki
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.2
    • /
    • pp.131-150
    • /
    • 2015
  • Omission of noun phrases for obligatory cases is a common phenomenon in sentences of Korean and Japanese, which is not observed in English. When an argument of a predicate can be filled with a noun phrase co-referential with the title, the argument is more easily omitted in Encyclopedia texts. The omitted noun phrase is called a zero anaphor or zero pronoun. Encyclopedias like Wikipedia are major source for information extraction by intelligent application systems such as information retrieval and question answering systems. However, omission of noun phrases makes the quality of information extraction poor. This paper deals with the problem of developing a system that can restore omitted noun phrases in encyclopedia documents. The problem that our system deals with is almost similar to zero anaphora resolution which is one of the important problems in natural language processing. A noun phrase existing in the text that can be used for restoration is called an antecedent. An antecedent must be co-referential with the zero anaphor. While the candidates for the antecedent are only noun phrases in the same text in case of zero anaphora resolution, the title is also a candidate in our problem. In our system, the first stage is in charge of detecting the zero anaphor. In the second stage, antecedent search is carried out by considering the candidates. If antecedent search fails, an attempt made, in the third stage, to use the title as the antecedent. The main characteristic of our system is to make use of a structural SVM for finding the antecedent. The noun phrases in the text that appear before the position of zero anaphor comprise the search space. The main technique used in the methods proposed in previous research works is to perform binary classification for all the noun phrases in the search space. The noun phrase classified to be an antecedent with highest confidence is selected as the antecedent. However, we propose in this paper that antecedent search is viewed as the problem of assigning the antecedent indicator labels to a sequence of noun phrases. In other words, sequence labeling is employed in antecedent search in the text. We are the first to suggest this idea. To perform sequence labeling, we suggest to use a structural SVM which receives a sequence of noun phrases as input and returns the sequence of labels as output. An output label takes one of two values: one indicating that the corresponding noun phrase is the antecedent and the other indicating that it is not. The structural SVM we used is based on the modified Pegasos algorithm which exploits a subgradient descent methodology used for optimization problems. To train and test our system we selected a set of Wikipedia texts and constructed the annotated corpus in which gold-standard answers are provided such as zero anaphors and their possible antecedents. Training examples are prepared using the annotated corpus and used to train the SVMs and test the system. For zero anaphor detection, sentences are parsed by a syntactic analyzer and subject or object cases omitted are identified. Thus performance of our system is dependent on that of the syntactic analyzer, which is a limitation of our system. When an antecedent is not found in the text, our system tries to use the title to restore the zero anaphor. This is based on binary classification using the regular SVM. The experiment showed that our system's performance is F1 = 68.58%. This means that state-of-the-art system can be developed with our technique. It is expected that future work that enables the system to utilize semantic information can lead to a significant performance improvement.

Interministerial GHS Activities and Implementation in Korea

  • Yu, Il-Je
    • Proceedings of the Korean Environmental Health Society Conference
    • /
    • 2005.06a
    • /
    • pp.240-248
    • /
    • 2005
  • To implement a globally harmonized system of classification and labeling of chemicals (GHS) in Korea, an interminsterial GHS working group involving 6 ministries established an expert working group composed of 7 experts from relevant organizations and one private consultant to prepare an officialKorean GHS version by March, 2005. As such, the translation and review of the official Korean GHS version, including annexes, started in October, 2004 and was completed on March 15, 2005. The official Korean GHS version has now been posted on the websites of the relevant ministries and organizations to solicit public opinions. The official Korean GHS version will be finalized after a public hearing scheduled forMay, 2005. Collaborative efforts as regards implementing and disseminating the GHS in Korea will be continued to avoid any confusion or duplication and for effective use of resources. The globally harmonized system of classifying and labeling chemicals (GHS) was originally adopted in 1992 at the United Nations Conference on Environment and Development (UNCED), as subsequently reflected in Agenda 21 chapter 19. The work was coordinated and managed under the auspices of the Interorganization Programme for the Sound Management of Chemicals(IOMC) Coordinating Group for the Harmonization of Chemical Classification Systems (UNCEGHS). The technical focal points for completing the work were the International Labour Organization (ILO); Organization for Economic Cooperation and Development (OECD); and United Nations Economic and Social Council's Subcommittee of Experts on the Transport of Dangerous Goods (UNSCETDG). The work was finalized in October 2002, and the World Summit on Sustainable Development in Johannesburg on 4 September 2002 encouraged countries to implement the new GHS as soon as possible with a view to having the system fully operational by 2008 (UN, 2003). Implementation has already started with pilot countries introducing the system to their national practices in different regions of the world. The GHS text, called the purple book, becameavailable as a W publication in early 2003. The GHS text, called the purple book, becameavailable as a UN publication in early 2003. The GHS system will be kept dynamic, and regularly revised and made more efficient as experience is gained in its implementation. While national or regional governments are the primary audiences for this document, it also contains sufficient context and guidance for those in industry who will ultimately be implementing the national requirements that will be introduced (UN, 2003). The Japanese government published their official Japanese GHS version, the first in Asia, in April 2004 after starting work in January 2003 based on an interministerial chemical coordination committee involving 7 ministries, including the Ministry of Foreign Affairs, Ministry of Internal Affairs and Communications, Ministry of Health, Labour, and Welfare, Ministry of Agriculture, Forestry and Fisheries, Ministry of Economy, Trade and Industry, Ministry of Land, Infrastructure, and Transport, and Ministry of Environment (MOE, 2004). Accordingly, similar to the Japanese GHS efforts, this paper presents the interministerial efforts involved in publishing the official Korean GHS version.

  • PDF

Studies on Differential Therapeutic Principle of Three Yang and Three Yin through Analysis of Pathological Transmission (<상한론(傷寒論)>의 병리전변분석을 통한 중경(仲景)의 삼음삼양(三陰三陽) 증치원리(證治原理) 연구)

  • Chi, Gyoo Yong
    • Journal of Physiology & Pathology in Korean Medicine
    • /
    • v.28 no.4
    • /
    • pp.365-370
    • /
    • 2014
  • The intrinsic concepts of the three yin and three yang diseases in is unclear yet in spite of considerable controversy. In order to answer these problems, the structures of pathological transmission and anatomical terms used in the text were analyzed first. On these structural bases, the theoretical background and differential therapeutic principles of three yin and three yang disease classification. The organic structures frequently used in the text were heart, stomach, pancreas, blood chamber and urinary bladder, and the important regions in the transmission were chest, flank, epigastrium, abdomen, hypogastrium, groin on the other hand. When a host is invaded by extrinsic pathogen, an affinity is formed between the two based on the similarity of epidermal density condition and nutrient-defense features and existing disorders in the body. And then the symptoms show in 3 stages with 6 patterns in the general infective diseases. The initial stage is the period that the syndrome is limited in the external flesh area, and it mainly corresponds with taiyang bing besides the other exterior patterns of 3 yang and 2 yin bing. The middle stage is to the climax after the end of initial stage and it corresponds with mainly yangming bing including shaoyang and taiyin bing. In the terminal stage, the host gradually falls into exhaustive step or recovery phase, corresponding with shaoyin and jueyin bing. Conclusively, these dual meanings of three yang and yin should be a first guide and principle of treatment against various infective diseases.

Development of An Instructional material for High School Environmental Education Emphasizing Affective Objectives (정의적 영역 중심의 고등학교 환경 교재 개발)

  • 박진희;장남기
    • Hwankyungkyoyuk
    • /
    • v.6 no.1
    • /
    • pp.63-99
    • /
    • 1994
  • The international environmental activities and environmental education began in 1970's. Environmental education in Korea was emphasized since the Forth National Curriculum. 'The Environmental Education Curriculum' will be separated as one of the most important parts in the Sixth National Education Curriculum in Korea. The purpose of this study was development. of 'Environmental Science' of high school appropriate to Sixth National Education Curriculum. First step was to state goals of environmental education in detail based on analysis of goals about environmental education in our country and other countries. Second was to analyse seven environments-related texts of Korea, America and England. Third, to measure how much environmental education has achieved in Fifth National Curriculum of Korea. Fourth, to develop a new environmental text of high school level. Fifth, to verify the effect of developed environmental text. The environmental part of 'Science I'(unit V. Life and Environments) and high school environments-related reference text(Survival and Environments) in Korea, American knowledges. American 'Environments' was stressed in many skills but they didn't include various teaching strategies. On the other hand, American 'Science-Technology-Society(S-T-S)' and British 'Science and Technology in Society(SATIS)' were stressed in knowledges and skills, and they included many teaching strategies and student actions. American 'S-T-S' was the only one stressed in values and attitudes. And all seven texts were not interested in behaviors and participations. To measure the achievement of environmental education by questionnaire, 497 high school students in total were selected from five different schools. Actually, most students had a positive thinkings and attitudes in their hearts about environmental problems, about environmental problems, but many of them did not take actions to solve environmental problems and to protect environments. The higher the score students got in 'knowledges and informations', the higher the score in 'skill'. It implies that learning of skills is based on learning of knowledges and informations about environments. On the other hand, much knowledges and information about environments has not always ensured positive thinkings and attitudes or active behaviors and participations to solve environmental problem. In view that ultimate aim of environmental education is forming responsible environmental behaviors and the goals of values and behaviors are as important as knowledges and skills. A new environmental text of high school level was developed and it was based on analysis of seven texts and environmental education in Fifth Korean Curriculum. This text have seven units, 1. Habitates : What're the meanings?, 2. Nuclear Energy : Can't be Avoid?, 3. Acid Rain : What're the Messages?, 4. Ethanol : Is this Future Fuel?, 5. Wastes : A New War!, 6. What're the National and Gloval Environmental education and avoided from the array of knowledges. Therefore included various teaching strategies and independent actions of students. 'Open-ended value learning' and 'free behavior learning' in text were special learning parts for aquisition of values and formation of behaviors. To verify the effects. of new developed environmental text, the direct learning was carried out by 286 students in total. Post test scores of experimental groups per each units were significantly higher than those of control groups from five different schools were as follows. For validity of selecting contents for units, 74% of respondent replied positively. For classification and presentation of four goal-groups, 90% replied positively in validity and 82%, in utility. For validity of various teaching strategies, 88% and for the degree of including student-centered independent actions, 86% replied positively, For importances and expected effects of 'open=ended value learning' and 'free behavior learning', showed positive responses respectively, 88%, 92% Therefore this text is effective to achieve four goals of environmental education equally.

  • PDF

A Study of 'Emotion Trigger' by Text Mining Techniques (텍스트 마이닝을 이용한 감정 유발 요인 'Emotion Trigger'에 관한 연구)

  • An, Juyoung;Bae, Junghwan;Han, Namgi;Song, Min
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.2
    • /
    • pp.69-92
    • /
    • 2015
  • The explosion of social media data has led to apply text-mining techniques to analyze big social media data in a more rigorous manner. Even if social media text analysis algorithms were improved, previous approaches to social media text analysis have some limitations. In the field of sentiment analysis of social media written in Korean, there are two typical approaches. One is the linguistic approach using machine learning, which is the most common approach. Some studies have been conducted by adding grammatical factors to feature sets for training classification model. The other approach adopts the semantic analysis method to sentiment analysis, but this approach is mainly applied to English texts. To overcome these limitations, this study applies the Word2Vec algorithm which is an extension of the neural network algorithms to deal with more extensive semantic features that were underestimated in existing sentiment analysis. The result from adopting the Word2Vec algorithm is compared to the result from co-occurrence analysis to identify the difference between two approaches. The results show that the distribution related word extracted by Word2Vec algorithm in that the words represent some emotion about the keyword used are three times more than extracted by co-occurrence analysis. The reason of the difference between two results comes from Word2Vec's semantic features vectorization. Therefore, it is possible to say that Word2Vec algorithm is able to catch the hidden related words which have not been found in traditional analysis. In addition, Part Of Speech (POS) tagging for Korean is used to detect adjective as "emotional word" in Korean. In addition, the emotion words extracted from the text are converted into word vector by the Word2Vec algorithm to find related words. Among these related words, noun words are selected because each word of them would have causal relationship with "emotional word" in the sentence. The process of extracting these trigger factor of emotional word is named "Emotion Trigger" in this study. As a case study, the datasets used in the study are collected by searching using three keywords: professor, prosecutor, and doctor in that these keywords contain rich public emotion and opinion. Advanced data collecting was conducted to select secondary keywords for data gathering. The secondary keywords for each keyword used to gather the data to be used in actual analysis are followed: Professor (sexual assault, misappropriation of research money, recruitment irregularities, polifessor), Doctor (Shin hae-chul sky hospital, drinking and plastic surgery, rebate) Prosecutor (lewd behavior, sponsor). The size of the text data is about to 100,000(Professor: 25720, Doctor: 35110, Prosecutor: 43225) and the data are gathered from news, blog, and twitter to reflect various level of public emotion into text data analysis. As a visualization method, Gephi (http://gephi.github.io) was used and every program used in text processing and analysis are java coding. The contributions of this study are as follows: First, different approaches for sentiment analysis are integrated to overcome the limitations of existing approaches. Secondly, finding Emotion Trigger can detect the hidden connections to public emotion which existing method cannot detect. Finally, the approach used in this study could be generalized regardless of types of text data. The limitation of this study is that it is hard to say the word extracted by Emotion Trigger processing has significantly causal relationship with emotional word in a sentence. The future study will be conducted to clarify the causal relationship between emotional words and the words extracted by Emotion Trigger by comparing with the relationships manually tagged. Furthermore, the text data used in Emotion Trigger are twitter, so the data have a number of distinct features which we did not deal with in this study. These features will be considered in further study.

Impact of Word Embedding Methods on Performance of Sentiment Analysis with Machine Learning Techniques

  • Park, Hoyeon;Kim, Kyoung-jae
    • Journal of the Korea Society of Computer and Information
    • /
    • v.25 no.8
    • /
    • pp.181-188
    • /
    • 2020
  • In this study, we propose a comparative study to confirm the impact of various word embedding techniques on the performance of sentiment analysis. Sentiment analysis is one of opinion mining techniques to identify and extract subjective information from text using natural language processing and can be used to classify the sentiment of product reviews or comments. Since sentiment can be classified as either positive or negative, it can be considered one of the general classification problems. For sentiment analysis, the text must be converted into a language that can be recognized by a computer. Therefore, text such as a word or document is transformed into a vector in natural language processing called word embedding. Various techniques, such as Bag of Words, TF-IDF, and Word2Vec are used as word embedding techniques. Until now, there have not been many studies on word embedding techniques suitable for emotional analysis. In this study, among various word embedding techniques, Bag of Words, TF-IDF, and Word2Vec are used to compare and analyze the performance of movie review sentiment analysis. The research data set for this study is the IMDB data set, which is widely used in text mining. As a result, it was found that the performance of TF-IDF and Bag of Words was superior to that of Word2Vec and TF-IDF performed better than Bag of Words, but the difference was not very significant.

Semantic analysis via application of deep learning using Naver movie review data (네이버 영화 리뷰 데이터를 이용한 의미 분석(semantic analysis))

  • Kim, Sojin;Song, Jongwoo
    • The Korean Journal of Applied Statistics
    • /
    • v.35 no.1
    • /
    • pp.19-33
    • /
    • 2022
  • With the explosive growth of social media, its abundant text-based data generated by web users has become an important source for data analysis. For example, we often witness online movie reviews from the 'Naver Movie' affecting the general public to decide whether they should watch the movie or not. This study has conducted analysis on the Naver Movie's text-based review data to predict the actual ratings. After examining the distribution of movie ratings, we performed semantics analysis using Korean Natural Language Processing. This research sought to find the best review rating prediction model by comparing machine learning and deep learning models. We also compared various regression and classification models in 2-class and multi-class cases. Lastly we explained the causes of review misclassification related to movie review data characteristics.

A Study on Automatic Classification Model of Documents Based on Korean Standard Industrial Classification (한국표준산업분류를 기준으로 한 문서의 자동 분류 모델에 관한 연구)

  • Lee, Jae-Seong;Jun, Seung-Pyo;Yoo, Hyoung Sun
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.3
    • /
    • pp.221-241
    • /
    • 2018
  • As we enter the knowledge society, the importance of information as a new form of capital is being emphasized. The importance of information classification is also increasing for efficient management of digital information produced exponentially. In this study, we tried to automatically classify and provide tailored information that can help companies decide to make technology commercialization. Therefore, we propose a method to classify information based on Korea Standard Industry Classification (KSIC), which indicates the business characteristics of enterprises. The classification of information or documents has been largely based on machine learning, but there is not enough training data categorized on the basis of KSIC. Therefore, this study applied the method of calculating similarity between documents. Specifically, a method and a model for presenting the most appropriate KSIC code are proposed by collecting explanatory texts of each code of KSIC and calculating the similarity with the classification object document using the vector space model. The IPC data were collected and classified by KSIC. And then verified the methodology by comparing it with the KSIC-IPC concordance table provided by the Korean Intellectual Property Office. As a result of the verification, the highest agreement was obtained when the LT method, which is a kind of TF-IDF calculation formula, was applied. At this time, the degree of match of the first rank matching KSIC was 53% and the cumulative match of the fifth ranking was 76%. Through this, it can be confirmed that KSIC classification of technology, industry, and market information that SMEs need more quantitatively and objectively is possible. In addition, it is considered that the methods and results provided in this study can be used as a basic data to help the qualitative judgment of experts in creating a linkage table between heterogeneous classification systems.

An Experimental Study on the Performance Improvement of Automatic Classification for the Articles of Korean Journals Based on Controlled Keywords in International Database (해외 데이터베이스의 통제키워드에 기초한 국내 학술지 논문의 자동분류 성능 향상에 관한 실험적 연구)

  • Kim, Pan Jun;Lee, Jae Yun
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.48 no.3
    • /
    • pp.491-510
    • /
    • 2014
  • As a major factor for efficient management and retrieval of the articles in databases, keywords are classified into uncontrolled keywords and controlled keywords. Most of Korean scholarly databases fail to provide controlled vocabularies to indexing research articles which help users to retrieve relevant papers exhaustively. In this paper, we carried out automatic descriptor assignment experiments to Korean articles using automatic classifiers learned with descriptors in international database. The results of the experiments show that the classifier learning with descriptors in international database can potentially offer controlled vocabularies to Korean scholarly articles having English s. Also, we sought to improve the performance of automatic descriptor assignment using various classifiers and combination of them.

A Study on the Social Media Sharing Intention by Exhibition Visitors -Focused on D Museum Plastic-Fantastic and Instagram- (전시방문객의 소셜미디어 공유의도에 관한 연구 -디뮤지엄의 Plastic Fantastic과 Instagram을 중심으로-)

  • Kim, Chaeeun;Lee, Joonhan;Kim, Sun Mee
    • Journal of Fashion Business
    • /
    • v.22 no.4
    • /
    • pp.20-29
    • /
    • 2018
  • Today, visitors of art galleries like to share their life in their communities than interacting with artwork. Meantime, image sharing of an exhibition on social media has become more important than actual watching of the artwork. Accordingly, most of the galleries have started paying more attention in organizing an exhibition environment for proof-shots to attract more visitors. We initially conducted research about the internet environment from the late 1990s to the recent years and analyzed the changing watching patterns of the exhibition since the advent of social media. Secondly, for empirical case analysis, we selected 'Plastic Fantastic' held in D-Museum as the target of analysis. The analysis targeted 500 recent postings that were discovered on Instagram on March 4, 2018, as 'Plastic-Fantastic'(in Korean). The methods of analysis included classification types of image, hashtag, and text on Instagram and were arranged in an order of relation to the exhibits. Based on the image analysis, 44.2% of the images involved exhibition displays; the others included a person or other goods. Based on the results of the text and hashtag analysis, only 3.6% of posting included information about the exhibition and 56.4% had non-related inflow hashtags only with image. The behavior of these shares is likely to gradually lose the inherent meaning of the exhibition and to the value rather than imparting the artistic thrill that viewers derive from art. Exhibition should try to seek deep interaction between the display, audience, and social media users, rather than encouraging the visitors to take proof-shots.