• Title/Summary/Keyword: Text Mining for Korean

Search Result 638, Processing Time 0.027 seconds

Agriculture Big Data Analysis System Based on Korean Market Information

  • Chuluunsaikhan, Tserenpurev;Song, Jin-Hyun;Yoo, Kwan-Hee;Rah, Hyung-Chul;Nasridinov, Aziz
    • Journal of Multimedia Information System
    • /
    • v.6 no.4
    • /
    • pp.217-224
    • /
    • 2019
  • As the world's population grows, how to maintain the food supply is becoming a bigger problem. Now and in the future, big data will play a major role in decision making in the agriculture industry. The challenge is how to obtain valuable information to help us make future decisions. Big data helps us to see history clearer, to obtain hidden values, and make the right decisions for the government and farmers. To contribute to solving this challenge, we developed the Agriculture Big Data Analysis System. The system consists of agricultural big data collection, big data analysis, and big data visualization. First, we collected structured data like price, climate, yield, etc., and unstructured data, such as news, blogs, TV programs, etc. Using the data that we collected, we implement prediction algorithms like ARIMA, Decision Tree, LDA, and LSTM to show the results in data visualizations.

Customized recommendation system through product review analysis (상품 리뷰 분석을 통한 사용자 맞춤형 추천 시스템)

  • Hwang, Doyeun;Bae, Sangjung;Kim, Changsoo;Jung, Heokyung
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2018.05a
    • /
    • pp.460-461
    • /
    • 2018
  • The traditional recommendation system is developed on the assumption that users behave independently, and have problem of readability and efficiency are inferior due to simply sort products or lack of function for associate product attributes with user's taste. To solve this problem in this study we propose a system that provides user customized information that the analysis of the unstructured review data with the purchase histories of users processed with meaningful information after crawling product review data using text mining with R. This allows to help user make decisions can be provided only necessary information without analyze massive amounts of products review data.

  • PDF

Development of Classification Model for Healthcare Contents on the Online Community (온라인 커뮤니티에서의 건강 관련 콘텐츠 분류 모형 개발)

  • Kim, Tae-Yun;Kim, Yoo-Sin;Choi, Sang-Hyun;Kim, Do-Hun;Chang, You-Jin
    • The Journal of Information Systems
    • /
    • v.26 no.4
    • /
    • pp.285-301
    • /
    • 2017
  • Purpose In this paper we verified the reliabilities of healthcare-related information provided by various users on the site of Naver Jisikin, a Korean typical search platform. Based on Q&A contents we validated answers' reliabilities to the asked questions about a lung cancer with the help of professors at a medical school. Design/methodology/approach The content analysis includes that the types of questions are classified into symptom/diagnosis, therapy, prognosis, after-management and so on. The answers contains advice, advertisement, oriental medicine, and religion as well as the above 5 question categories. The validation results of medical evidence about each answer show that only 49% among all answers have medical grounds. Findings We classified the medical grounded answers into three levels; high, medium and low. Among all answers we need to find out the answers including advertisement because the answers can be harmful to patients. We found the method to select the answers containing advertisement contents with the help of text mining research. The selection model presents high performance as 84% classification accuracy.

Identifying literature-based significant genes and discovering novel drug indications on PPI network

  • Park, Minseok;Jang, Giup;Lee, Taekeon;Yoon, Youngmi
    • Journal of the Korea Society of Computer and Information
    • /
    • v.22 no.3
    • /
    • pp.131-138
    • /
    • 2017
  • New drug development is time-consuming and costly. Hence, it is necessary to repurpose old drugs for finding new indication. We suggest the way that repurposing old drug using massive literature data and biological network. We supposed a disease-drug relationship can be available if signal pathways of the relationship include significant genes identified in literature data. This research is composed of three steps-identifying significant gene using co-occurrence in literature; analyzing the shortest path on biological network; and scoring a relationship with comparison between the significant genes and the shortest paths. Based on literatures, we identify significant genes based on the co-occurrence frequency between a gene and disease. With the network that include weight as possibility of interaction between genes, we use shortest paths on the network as signal pathways. We perform comparing genes that identified as significant gene and included on signal pathways, calculating the scores and then identifying the candidate drugs. With this processes, we show the drugs having new possibility of drug repurposing and the use of our method as the new method of drug repurposing.

Monetary policy synchronization of Korea and United States reflected in the statements (통화정책 결정문에 나타난 한미 통화정책 동조화 현상 분석)

  • Chang, Youngjae
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.1
    • /
    • pp.115-126
    • /
    • 2021
  • Central banks communicate with the market through a statement on the direction of monetary policy while implementing monetary policy. The rapid contraction of the global economy due to the recent Covid-19 pandemic could be compared to the crisis situation during the 2008 global financial crisis. In this paper, we analyzed the text data from the monetary policy statements of the Bank of Korea and Fed reflecting monetary policy directions focusing on how they were affected in the face of a global crisis. For analysis, we collected the text data of the two countries' monetary policy direction reports published from October 1999 to September 2020. We examined the semantic features using word cloud and word embedding, and analyzed the trend of the similarity between two countries' documents through a piecewise regression tree model. The visualization result shows that both the Bank of Korea and the US Fed have published the statements with refined words of clear meaning for transparent and effective communication with the market. The analysis of the dissimilarity trend of documents in both countries also shows that there exists a sense of synchronization between them as the rapid changes in the global economic environment affect monetary policy.

Analysis of Twitter for 2012 South Korea Presidential Election by Text Mining Techniques (텍스트 마이닝을 이용한 2012년 한국대선 관련 트위터 분석)

  • Bae, Jung-Hwan;Son, Ji-Eun;Song, Min
    • Journal of Intelligence and Information Systems
    • /
    • v.19 no.3
    • /
    • pp.141-156
    • /
    • 2013
  • Social media is a representative form of the Web 2.0 that shapes the change of a user's information behavior by allowing users to produce their own contents without any expert skills. In particular, as a new communication medium, it has a profound impact on the social change by enabling users to communicate with the masses and acquaintances their opinions and thoughts. Social media data plays a significant role in an emerging Big Data arena. A variety of research areas such as social network analysis, opinion mining, and so on, therefore, have paid attention to discover meaningful information from vast amounts of data buried in social media. Social media has recently become main foci to the field of Information Retrieval and Text Mining because not only it produces massive unstructured textual data in real-time but also it serves as an influential channel for opinion leading. But most of the previous studies have adopted broad-brush and limited approaches. These approaches have made it difficult to find and analyze new information. To overcome these limitations, we developed a real-time Twitter trend mining system to capture the trend in real-time processing big stream datasets of Twitter. The system offers the functions of term co-occurrence retrieval, visualization of Twitter users by query, similarity calculation between two users, topic modeling to keep track of changes of topical trend, and mention-based user network analysis. In addition, we conducted a case study on the 2012 Korean presidential election. We collected 1,737,969 tweets which contain candidates' name and election on Twitter in Korea (http://www.twitter.com/) for one month in 2012 (October 1 to October 31). The case study shows that the system provides useful information and detects the trend of society effectively. The system also retrieves the list of terms co-occurred by given query terms. We compare the results of term co-occurrence retrieval by giving influential candidates' name, 'Geun Hae Park', 'Jae In Moon', and 'Chul Su Ahn' as query terms. General terms which are related to presidential election such as 'Presidential Election', 'Proclamation in Support', Public opinion poll' appear frequently. Also the results show specific terms that differentiate each candidate's feature such as 'Park Jung Hee' and 'Yuk Young Su' from the query 'Guen Hae Park', 'a single candidacy agreement' and 'Time of voting extension' from the query 'Jae In Moon' and 'a single candidacy agreement' and 'down contract' from the query 'Chul Su Ahn'. Our system not only extracts 10 topics along with related terms but also shows topics' dynamic changes over time by employing the multinomial Latent Dirichlet Allocation technique. Each topic can show one of two types of patterns-Rising tendency and Falling tendencydepending on the change of the probability distribution. To determine the relationship between topic trends in Twitter and social issues in the real world, we compare topic trends with related news articles. We are able to identify that Twitter can track the issue faster than the other media, newspapers. The user network in Twitter is different from those of other social media because of distinctive characteristics of making relationships in Twitter. Twitter users can make their relationships by exchanging mentions. We visualize and analyze mention based networks of 136,754 users. We put three candidates' name as query terms-Geun Hae Park', 'Jae In Moon', and 'Chul Su Ahn'. The results show that Twitter users mention all candidates' name regardless of their political tendencies. This case study discloses that Twitter could be an effective tool to detect and predict dynamic changes of social issues, and mention-based user networks could show different aspects of user behavior as a unique network that is uniquely found in Twitter.

Recent Domestic Research Trend Over Startups: Focusing on the Social Network Analysis of Research Variables (스타트업 관련 최근 국내 연구 동향: 연구 변수들에 대한 소셜 네트워크 분석을 중심으로)

  • Kil, ChangMin;Yang, DongWoo
    • Asia-Pacific Journal of Business Venturing and Entrepreneurship
    • /
    • v.17 no.2
    • /
    • pp.81-97
    • /
    • 2022
  • This paper's purpose is to get hold of the recent research trend by analyzing the variables uesd in startups related papers. The startups related papers in this paper are the papers which include 'startups' in the title of the registered papers from the year 2013 to the year 2020. This study's analysis methods are text-mining of all variables and text-network analysis of affected variables. Visualizing tool for network analysis is Gephi. The result of variables' analysis is as follows. First, independent variables consist mainly of variables about startups' internal factors and outside environment, but due to startups' features like early stage company's features, innovative features, most of variables are about enterprise internal competitiveness, marketing 4P strategy, entrepreneurship, coopreation method, transformational leadership, enterprise features, lean startup strategy, enterprise internal communication, value orientation, task conflict, relationship conflict, knowledge sharing, etc. Second, dependent variables are mainly about outcome, and are classified into financial performance and non-financial performance by overall concept. In other words, startups related papers have higher interest in non-financial performance, like management performance, team performance, SCM performance as well as financial performance like sales quantity owing to startups' immaturity in getting good financial performance. Through this study we can find out as follows. Although there are not many officially registered papers dealing with startups, those papers include various themes about stratups. For example, there are trendy themes like lean startups strategy, crowdfunding, influencer and accelerator, etc.

KNU Korean Sentiment Lexicon: Bi-LSTM-based Method for Building a Korean Sentiment Lexicon (Bi-LSTM 기반의 한국어 감성사전 구축 방안)

  • Park, Sang-Min;Na, Chul-Won;Choi, Min-Seong;Lee, Da-Hee;On, Byung-Won
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.4
    • /
    • pp.219-240
    • /
    • 2018
  • Sentiment analysis, which is one of the text mining techniques, is a method for extracting subjective content embedded in text documents. Recently, the sentiment analysis methods have been widely used in many fields. As good examples, data-driven surveys are based on analyzing the subjectivity of text data posted by users and market researches are conducted by analyzing users' review posts to quantify users' reputation on a target product. The basic method of sentiment analysis is to use sentiment dictionary (or lexicon), a list of sentiment vocabularies with positive, neutral, or negative semantics. In general, the meaning of many sentiment words is likely to be different across domains. For example, a sentiment word, 'sad' indicates negative meaning in many fields but a movie. In order to perform accurate sentiment analysis, we need to build the sentiment dictionary for a given domain. However, such a method of building the sentiment lexicon is time-consuming and various sentiment vocabularies are not included without the use of general-purpose sentiment lexicon. In order to address this problem, several studies have been carried out to construct the sentiment lexicon suitable for a specific domain based on 'OPEN HANGUL' and 'SentiWordNet', which are general-purpose sentiment lexicons. However, OPEN HANGUL is no longer being serviced and SentiWordNet does not work well because of language difference in the process of converting Korean word into English word. There are restrictions on the use of such general-purpose sentiment lexicons as seed data for building the sentiment lexicon for a specific domain. In this article, we construct 'KNU Korean Sentiment Lexicon (KNU-KSL)', a new general-purpose Korean sentiment dictionary that is more advanced than existing general-purpose lexicons. The proposed dictionary, which is a list of domain-independent sentiment words such as 'thank you', 'worthy', and 'impressed', is built to quickly construct the sentiment dictionary for a target domain. Especially, it constructs sentiment vocabularies by analyzing the glosses contained in Standard Korean Language Dictionary (SKLD) by the following procedures: First, we propose a sentiment classification model based on Bidirectional Long Short-Term Memory (Bi-LSTM). Second, the proposed deep learning model automatically classifies each of glosses to either positive or negative meaning. Third, positive words and phrases are extracted from the glosses classified as positive meaning, while negative words and phrases are extracted from the glosses classified as negative meaning. Our experimental results show that the average accuracy of the proposed sentiment classification model is up to 89.45%. In addition, the sentiment dictionary is more extended using various external sources including SentiWordNet, SenticNet, Emotional Verbs, and Sentiment Lexicon 0603. Furthermore, we add sentiment information about frequently used coined words and emoticons that are used mainly on the Web. The KNU-KSL contains a total of 14,843 sentiment vocabularies, each of which is one of 1-grams, 2-grams, phrases, and sentence patterns. Unlike existing sentiment dictionaries, it is composed of words that are not affected by particular domains. The recent trend on sentiment analysis is to use deep learning technique without sentiment dictionaries. The importance of developing sentiment dictionaries is declined gradually. However, one of recent studies shows that the words in the sentiment dictionary can be used as features of deep learning models, resulting in the sentiment analysis performed with higher accuracy (Teng, Z., 2016). This result indicates that the sentiment dictionary is used not only for sentiment analysis but also as features of deep learning models for improving accuracy. The proposed dictionary can be used as a basic data for constructing the sentiment lexicon of a particular domain and as features of deep learning models. It is also useful to automatically and quickly build large training sets for deep learning models.

Selection of Effective Herbal Medicines for Parkinson's Disease Based on the Text Mining of the Classical Korean Medical Literature Donguibogam

  • Bae, Hyo Won;Lee, Tae Wook;Choi, Byung Tae;Shin, Hwa Kyoung;Yun, Young Ju
    • The Journal of Korean Medicine
    • /
    • v.42 no.4
    • /
    • pp.120-132
    • /
    • 2021
  • Objectives: The prevalence of Parkinson's disease is on an upward trend along with an increase in the aging population but there is no available treatment that halts the progression of neurodegeneration. This study reports a numerical analysis on Donguibogam and suggests novel herbal drugs, which have never been researched before but found to be deemed effective in this study. Methods: Referring to 71 Korean medicine symptom terms that represent the symptoms of Parkinson's disease, 4170 prescriptions described in Donguibogam were classified into two groups based on whether their main effects were effective for Parkinson's disease or not. Comparing the two groups, the chi-square test was performed to select statistically significant herbs, while the t-test, Wilcoxon test, and descriptive statistics were performed to determine the appropriate dose. Results: One hundred and twenty-seven prescriptions effective for Parkinson's disease were identified. The chi-square test determined 17 herbs that are effective for symptomatic treatment. Among the medicinal herbs, the authors suggest Osterici seu Notopterygii Radix et Rhizoma, Ephedrae Herba, Aconiti Tuber, Myrrha, Sinomeni Caulis et Rhizoma, and Aconiti Kusnezoffii Tuber as herbal candidates that have never been studied for Parkinson's disease. Through the statistical tests, it was judged that the mean value of the dose of the entire prescription was the appropriate dose for each herb. Conclusions: Seventeen herbs were selected for Parkinson's disease and the appropriate daily dose were calculated. Furthermore, this study presented a new process that applies a statistical method to traditional medical literature and preselecting herbs deemed effective for specific diseases.

Perception and Appraisal of Urban Park Users Using Text Mining of Google Maps Review - Cases of Seoul Forest, Boramae Park, Olympic Park - (구글맵리뷰 텍스트마이닝을 활용한 공원 이용자의 인식 및 평가 - 서울숲, 보라매공원, 올림픽공원을 대상으로 -)

  • Lee, Ju-Kyung;Son, Yong-Hoon
    • Journal of the Korean Institute of Landscape Architecture
    • /
    • v.49 no.4
    • /
    • pp.15-29
    • /
    • 2021
  • The study aims to grasp the perception and appraisal of urban park users through text analysis. This study used Google review data provided by Google Maps. Google Maps Review is an online review platform that provides information evaluating locations through social media and provides an understanding of locations from the perspective of general reviewers and regional guides who are registered as members of Google Maps. The study determined if the Google Maps Reviews were useful for extracting meaningful information about the user perceptions and appraisals for parks management plans. The study chose three urban parks in Seoul, South Korea; Seoul Forest, Boramae Park, and Olympic Park. Review data for each of these three parks were collected via web crawling using Python. Through text analysis, the keywords and network structure characteristics for each park were analyzed. The text was analyzed, as were park ratings, and the analysis compared the reviews of residents and foreign tourists. The common keywords found in the review comments for the three parks were "walking", "bicycle", "rest" and "picnic" for activities, "family", "child" and "dogs" for accompanying types, and "playground" and "walking trail" for park facilities. Looking at the characteristics of each park, Seoul Forest shows many outdoor activities based on nature, while the lack of parking spaces and congestion on weekends negatively impacted users. Boramae Park has the appearance of a city park, with various facilities providing numerous activities, but reviewers often cited the park's complexity and the negative aspects in terms of dog walking groups. At Olympic Park, large-scale complex facilities and cultural events were frequently mentioned, emphasizing its entertainment functions. Google Maps Review can function as useful data to identify parks' overall users' experiences and general feelings. Compared to data from other social media sites, Google Maps Review's data provides ratings and understanding factors, including user satisfaction and dissatisfaction.