• Title/Summary/Keyword: Korean text classification

Search Result 413, Processing Time 0.027 seconds

Sentiment Analysis of Korean Reviews Using CNN: Focusing on Morpheme Embedding (CNN을 적용한 한국어 상품평 감성분석: 형태소 임베딩을 중심으로)

  • Park, Hyun-jung;Song, Min-chae;Shin, Kyung-shik
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.2
    • /
    • pp.59-83
    • /
    • 2018
  • With the increasing importance of sentiment analysis to grasp the needs of customers and the public, various types of deep learning models have been actively applied to English texts. In the sentiment analysis of English texts by deep learning, natural language sentences included in training and test datasets are usually converted into sequences of word vectors before being entered into the deep learning models. In this case, word vectors generally refer to vector representations of words obtained through splitting a sentence by space characters. There are several ways to derive word vectors, one of which is Word2Vec used for producing the 300 dimensional Google word vectors from about 100 billion words of Google News data. They have been widely used in the studies of sentiment analysis of reviews from various fields such as restaurants, movies, laptops, cameras, etc. Unlike English, morpheme plays an essential role in sentiment analysis and sentence structure analysis in Korean, which is a typical agglutinative language with developed postpositions and endings. A morpheme can be defined as the smallest meaningful unit of a language, and a word consists of one or more morphemes. For example, for a word '예쁘고', the morphemes are '예쁘(= adjective)' and '고(=connective ending)'. Reflecting the significance of Korean morphemes, it seems reasonable to adopt the morphemes as a basic unit in Korean sentiment analysis. Therefore, in this study, we use 'morpheme vector' as an input to a deep learning model rather than 'word vector' which is mainly used in English text. The morpheme vector refers to a vector representation for the morpheme and can be derived by applying an existent word vector derivation mechanism to the sentences divided into constituent morphemes. By the way, here come some questions as follows. What is the desirable range of POS(Part-Of-Speech) tags when deriving morpheme vectors for improving the classification accuracy of a deep learning model? Is it proper to apply a typical word vector model which primarily relies on the form of words to Korean with a high homonym ratio? Will the text preprocessing such as correcting spelling or spacing errors affect the classification accuracy, especially when drawing morpheme vectors from Korean product reviews with a lot of grammatical mistakes and variations? We seek to find empirical answers to these fundamental issues, which may be encountered first when applying various deep learning models to Korean texts. As a starting point, we summarized these issues as three central research questions as follows. First, which is better effective, to use morpheme vectors from grammatically correct texts of other domain than the analysis target, or to use morpheme vectors from considerably ungrammatical texts of the same domain, as the initial input of a deep learning model? Second, what is an appropriate morpheme vector derivation method for Korean regarding the range of POS tags, homonym, text preprocessing, minimum frequency? Third, can we get a satisfactory level of classification accuracy when applying deep learning to Korean sentiment analysis? As an approach to these research questions, we generate various types of morpheme vectors reflecting the research questions and then compare the classification accuracy through a non-static CNN(Convolutional Neural Network) model taking in the morpheme vectors. As for training and test datasets, Naver Shopping's 17,260 cosmetics product reviews are used. To derive morpheme vectors, we use data from the same domain as the target one and data from other domain; Naver shopping's about 2 million cosmetics product reviews and 520,000 Naver News data arguably corresponding to Google's News data. The six primary sets of morpheme vectors constructed in this study differ in terms of the following three criteria. First, they come from two types of data source; Naver news of high grammatical correctness and Naver shopping's cosmetics product reviews of low grammatical correctness. Second, they are distinguished in the degree of data preprocessing, namely, only splitting sentences or up to additional spelling and spacing corrections after sentence separation. Third, they vary concerning the form of input fed into a word vector model; whether the morphemes themselves are entered into a word vector model or with their POS tags attached. The morpheme vectors further vary depending on the consideration range of POS tags, the minimum frequency of morphemes included, and the random initialization range. All morpheme vectors are derived through CBOW(Continuous Bag-Of-Words) model with the context window 5 and the vector dimension 300. It seems that utilizing the same domain text even with a lower degree of grammatical correctness, performing spelling and spacing corrections as well as sentence splitting, and incorporating morphemes of any POS tags including incomprehensible category lead to the better classification accuracy. The POS tag attachment, which is devised for the high proportion of homonyms in Korean, and the minimum frequency standard for the morpheme to be included seem not to have any definite influence on the classification accuracy.

Study of Toxicity Presence Classification about Herbal Diet in Tang-aec-pyeon of Dong-ui-bo-gam (동의보감 탕액편에 기재된 식이본초의 독성유무에 대한 분류 연구)

  • Shin, Ho-Dong;Jeong, Jong-Un
    • The Journal of Korean Medicine
    • /
    • v.32 no.1
    • /
    • pp.12-35
    • /
    • 2011
  • Objectives: The two criteria to clarify the toxicity of a herbal diet are well known. Although mechanical analysis of effective ingredients, a western approach, is widely used, the toxicity presence classification through the herbal analysis from a viewpoint of a theory of the herbal medicine properties has been disregarded. This study is for the safe use of a herbal diet through classification and study of toxicity presence in the herbal diet from the view of a theory of herbal medicine properties, one of the methods of Oriental Medicine. Methods: We classified and studied the toxicity presence in four kinds of herbal diets, waters and grains, animals groups, fruits and vegetables, and herbs and trees, excluding mineral natural drugs, of 1,400 kinds of medicines in 16 chapters of Tang-aec-pyeon, Dong-ui-bo-gam, for which the herbal analysis from a viewpoint of the theory of the herbal medicine properties has been used. The criteria of the toxicity presence in the herbal diet have been largely classified into the toxicant and the non-toxicant, and the toxicant is in turn classified into the insignificant, the medium and the significant. The category to clarify herbal diet has been limited to simultaneous utilization of food and natural drugs. The main text is Dong-ui-bo-gam, although diverse other references have also been used. Results: There are toxicant diets: a kind of tortoise meat of animals groups; five kinds of grains part in fruits and vegetables: aengdo, peach, oyat, small apple and gingko nut; and 12 kinds of vegetables part in fruits and vegetables: ginger, oriental cabbage, lettuce, chongbaek, onion, garlic, leek, fern, houttuynia cordata (myeol), pyeongji, geundae, and spinach, which should be prohibited from long-term use both as food and medicine. Conclusion: If herbal diet is used as health food supplements or food, the toxicity presence should be considered on the grounds of an Oriental Medicine theory of the herbal medicine properties.

The Development of the Model of Information Structure for Photo Archives in University Archives (대학기록관 사진 아카이브를 위한 정보구조 모형 제안)

  • Hyewon Lee;Seunghee Han
    • Journal of Korean Society of Archives and Records Management
    • /
    • v.23 no.1
    • /
    • pp.101-126
    • /
    • 2023
  • Photographic archives of universities are one of the most valuable types of records that establish the university's identity and provide historical evidence. Unlike text records, however, they are weak in conveying meanings. Therefore, it is difficult to support users' search and utilization unless the information of photo records is comprehensively described. In this study, for the university photo archives, we tried to structure the classification system of photo archives and develop a metadata set that reflects the category characteristics in the classification. To this end, the photo archives classification system and metadata elements of domestic and American university archives were analyzed and based on this, the model of information structure was proposed. The information structure model presented in this study can help university archives improve the data quality of their photo archives and support users with the abundant discovery of photo archives.

Safeguarding Korean Export Trade through Social Media-Driven Risk Identification and Characterization

  • Sithipolvanichgul, Juthamon;Abrahams, Alan S.;Goldberg, David M.;Zaman, Nohel;Baghersad, Milad;Nasri, Leila;Ractham, Peter
    • Journal of Korea Trade
    • /
    • v.24 no.8
    • /
    • pp.39-62
    • /
    • 2020
  • Purpose - Korean exports account for a vast proportion of Korean GDP, and large volumes of Korean products are sold in the United States. Identifying and characterizing actual and potential product hazards related to Korean products is critical to safeguard Korean export trade, as severe quality issues can impair Korea's reputation and reduce global consumer confidence in Korean products. In this study, we develop country-of-origin-based product risk analysis methods for social media with a specific focus on Korean-labeled products, for the purpose of safeguarding Korean export trade. Design/methodology - We employed two social media datasets containing consumer-generated product reviews. Sentiment analysis is a popular text mining technique used to quantify the type and amount of emotion that is expressed in the text. It is a useful tool for gathering customer opinions regarding products. Findings - We document and discuss the specific potential risks found in Korean-labeled products and explain their implications for safeguarding Korean export trade. Finally, we analyze the false positive matches that arise from the established dictionaries that were used for risk discovery and utilize these classification errors to suggest opportunities for the future refinement of the associated automated text analytic methods. Originality/value - Various studies have used online feedback from social media to analyze product defects. However, none of them links their findings to trade promotion and the protection of a specific country's exports. Therefore, it is important to fill this research gap, which could help to safeguard export trade in Korea.

A study on the theory of 'Eum-yang-Li-Hap (陰陽離合)' in 6th chapter of 'SoMoon (素問)' 'Yellow Emperior's Nei-ching (黃帝內經)' (황제내경(黃帝內經) 소문(素問) 음양이합론(陰陽離合論)에 대한 고찰(考察))

  • Ok, Do-Hoon;Hong, Won-Sik
    • Journal of Korean Medical classics
    • /
    • v.3
    • /
    • pp.501-552
    • /
    • 1989
  • In this thesis, I intend to study the translational and clinical interpretation through the theory of Eum-Yang-Li-Hap', and reached the following conclusions. 1. 'Eum-Yang (陰陽)' in title, means Yin and Yang as method of understanding nature or humanbody, and 'Li-Hap (離合)' in title, means classification and getting together. Especially there are a view that Eum Yang in title means only meridinans within the limit of human body, but the limit needn't, because the word 'Li-Hap of 3Yin-3Yang (三陰三陽之離合)' as meaning of human meridians in the text. 2. The content of the text is generally seperated into 3 parts, the 1st part contents properties of Li Hap of Yin and Yang. 2nd and 3rd parts content the explanation of property of 3Yin and 3Yang, as example of human meridians with local conception, and content nicknames of 3Yin-3Yang and present the Ideo of 'Kae-Hap-Choo (開闔樞)'. 3. 3Yin-3Yang in the text, many of annotators tried to explanate by three types of conception, of human meridians, of the 'Viscera-Bowels (臟腑)', or of the 'Element motions and Natural factors (運氣)'. I think that these three conceptions could be mixed when the text was written, and regarde for the present that 3Yin-3Yang is explanated by the conception of human meridians. 4. 'Eum (陰)' the head-letter of the nicknames of 3Yin-3Yang, I think that it means 'Jok-Gyeong (足經)' related with the words 'The earth belongs to Yin (地爲陰)' in the text. And it i s considered that further studies should be followed on the tail-words of 3Yin-3Yang's nicknams. 5. Kae-Hap-Choo, Used in similitude" as 'Li (離)' of 3Yin-3Yang, are seperated functions by location of 3Yin-3Yang. In text 'Tae-Yang (太陽)' and 'Tae-Eum (太陰)' act as 'Kae (開)', 'Yang-Myeong (陽明)', and 'Gweor-Eum (厥陰)' act as 'Hap (闔)', 'So-Yang (少陽)' and 'So-Eum (少陰)' act as 'Choo (樞).' But there is other theory that Gweor-Eum act as Choo, and So-Eum act as Hap. 6. The theory of Kae-Hop-Choo, including only Jok-Gyeong being main materials of 'Yook Gyeong-Byeon-Jeung (六經辨證) had influence on development of clinical studies. If the theory of Kae-Hap-Choo receives and unions the ideos of '3 burning-Spaces (三焦)', metabolism, etc. more development of medicine is expected.

  • PDF

A Spam Mail Classification Using Link Structure Analysis (링크구조분석을 이용한 스팸메일 분류)

  • Rhee, Shin-Young;Khil, A-Ra;Kim, Myung-Won
    • Journal of KIISE:Software and Applications
    • /
    • v.34 no.1
    • /
    • pp.30-39
    • /
    • 2007
  • The existing content-based spam mail filtering algorithms have difficulties in filtering spam mails when e-mails contain images but little text. In this thesis we propose an efficient spam mail classification algorithm that utilizes the link structure of e-mails. We compute the number of hyperlinks in an e-mail and the in-link frequencies of the web pages hyperlinked in the e-mail. Using these two features we classify spam mails and legitimate mails based on the decision tree trained for spam mail classification. We also suggest a hybrid system combining three different algorithms by majority voting: the link structure analysis algorithm, a modified link structure analysis algorithm, in which only the host part of the hyperlinked pages of an e-mail is used for link structure analysis, and the content-based method using SVM (support vector machines). The experimental results show that the link structure analysis algorithm slightly outperforms the existing content-based method with the accuracy of 94.8%. Moreover, the hybrid system achieves the accuracy of 97.0%, which is a significant performance improvement over the existing method.

Spin in Randomised Clinical Trial Reports of Interventions for Obesity (비만 중재 관련 무작위배정 비교임상연구 보고의 spin 연구)

  • Lee, Sle;Won, Jiyoon;Kim, Seoyeon;Park, Su Jeong;Lee, Hyangsook
    • Korean Journal of Acupuncture
    • /
    • v.34 no.4
    • /
    • pp.251-264
    • /
    • 2017
  • Objectives : To identify the prevalence and types of spin in randomised controlled trials(RCTs) of obesity with statistically non-significant results for primary outcomes to provide adequate reporting directions. Methods : Spin is specific reporting strategy that could lead the readers to misinterpret the results of RCTs. RCTs on obesity with statistically non-significant primary outcomes published from July 2015 to June 2016 were retrieved from PubMed. All included RCTs were classified into 3 intervention categories. The identification and classification of spin in the included articles was performed by two independent researchers. Results : Among 46 RCTs with statistically non-significant primary outcomes, 32 studies were assessed as having at least one spin in title, abstract or main text. Of these, 9 articles were on complementary and alternative medicine, 7 on western medicine and 16 on dietary supplement and exercise. The frequency of spin among the types of interventions was similar. The most common type of spin was 'focusing on statistical significance within-group comparison' in results section of abstract and main text, and 'focusing only on treatment effectiveness with no consideration of statistical significance' in conclusion section of abstract and main text. Studies where random sequence generation was appropriately done was less likely to have spin. Conclusions : As a majority of obesity RCTs have spin, researchers should pay more attention to adequately interpreting and reporting statistically non-significant results.

A Study on Classification of Wulao(五勞)·Liuji(六極)·Qishang(七傷) (오로(五勞)·육극(六極)·칠상(七傷)의 분류에 관한 고찰)

  • Kim, Jong-hyun
    • Journal of Korean Medical classics
    • /
    • v.32 no.2
    • /
    • pp.135-146
    • /
    • 2019
  • Objectives : This study examines the grounds on which Wulao(五勞) Liuji(六極) Qishang(七傷) which are categories of Xulao(虛勞) are differentiated, along with standards by which each category is further classified. Methods : Based on "Zhubingyuanhoulun(諸病源候論)", the first text to sort the different types and symptoms of Wulao(五勞) Liuji(六極) Qishang(七傷), each classification and its symptoms were analyzed. Texts which were written relatively close in time to "Zhubingyuanhoulun" were referenced in the process. Results & Conclusions : The differentiation of Wulao(五勞) Liuji(六極) Qishang(七傷) is based on the cause of illness. Wulao(五勞) is caused by mental activity which fatigues the Five Zang, Liuji(六極) is caused by exterior pathogens that damage the Five Body Elements, and Qishang(七傷) is caused by emotional factors as well as damaging practices. In close examination, Wulao(五勞) was further classified according to the different layers of mental activity, described in terms of taxation illness of the damaged Zang. Liuji(六極) is damage of the Five Body Elements by exterior pathogens, which was put into the spacial structure of nature and explained in six. Qishang(七傷) refers to the collective of representative symptoms and representative causes of Xulao.

Construction of Retrieval-Based Medical Database

  • Shin Yong-Won;Koo Bong-Oh;Park Byung-Rae
    • Biomedical Science Letters
    • /
    • v.10 no.4
    • /
    • pp.485-493
    • /
    • 2004
  • In the current field of Medical Informatics, the information increases, and changes fast, so we can access the various data types which are ranged from text to image type. A small number of technician digitizes these data to establish database, but it is needed a lot of money and time. Therefore digitization by many end-users confronting data and establishment of searching database is needed to manage increasing information effectively. New data and information are taken fast to provide the quality of care, diagnosis which is the basic work in the medicine. And also It is needed the medical database for purpose of private study and novice education, which is tool to make various data become knowledge. However, current medical database is used and developed only for the purpose of hospital work management. In this study, using text input, file import and object images are digitized to establish database by people who are worked at the medicine field but can not expertise to program. Data are hierarchically constructed and then knowledge is established using a tree type database establishment method. Consequently, we can get data fast and exactly through search, apply it to study as subject-oriented classification, apply it to diagnosis as time-depended reflection of data, and apply it to education and precaution through function of publishing questions and reusability of data.

  • PDF

Recognition of Characters Printed on PCB Components Using Deep Neural Networks (심층신경망을 이용한 PCB 부품의 인쇄문자 인식)

  • Cho, Tai-Hoon
    • Journal of the Semiconductor & Display Technology
    • /
    • v.20 no.3
    • /
    • pp.6-10
    • /
    • 2021
  • Recognition of characters printed or marked on the PCB components from images captured using cameras is an important task in PCB components inspection systems. Previous optical character recognition (OCR) of PCB components typically consists of two stages: character segmentation and classification of each segmented character. However, character segmentation often fails due to corrupted characters, low image contrast, etc. Thus, OCR without character segmentation is desirable and increasingly used via deep neural networks. Typical implementation based on deep neural nets without character segmentation includes convolutional neural network followed by recurrent neural network (RNN). However, one disadvantage of this approach is slow execution due to RNN layers. LPRNet is a segmentation-free character recognition network with excellent accuracy proved in license plate recognition. LPRNet uses a wide convolution instead of RNN, thus enabling fast inference. In this paper, LPRNet was adapted for recognizing characters printed on PCB components with fast execution and high accuracy. Initial training with synthetic images followed by fine-tuning on real text images yielded accurate recognition. This net can be further optimized on Intel CPU using OpenVINO tool kit. The optimized version of the network can be run in real-time faster than even GPU.