• Title/Summary/Keyword: Extract Keywords

Search Result 126, Processing Time 0.025 seconds

Multi-Vector Document Embedding Using Semantic Decomposition of Complex Documents (복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 방법론)

  • Park, Jongin;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.3
    • /
    • pp.19-41
    • /
    • 2019
  • According to the rapidly increasing demand for text data analysis, research and investment in text mining are being actively conducted not only in academia but also in various industries. Text mining is generally conducted in two steps. In the first step, the text of the collected document is tokenized and structured to convert the original document into a computer-readable form. In the second step, tasks such as document classification, clustering, and topic modeling are conducted according to the purpose of analysis. Until recently, text mining-related studies have been focused on the application of the second steps, such as document classification, clustering, and topic modeling. However, with the discovery that the text structuring process substantially influences the quality of the analysis results, various embedding methods have actively been studied to improve the quality of analysis results by preserving the meaning of words and documents in the process of representing text data as vectors. Unlike structured data, which can be directly applied to a variety of operations and traditional analysis techniques, Unstructured text should be preceded by a structuring task that transforms the original document into a form that the computer can understand before analysis. It is called "Embedding" that arbitrary objects are mapped to a specific dimension space while maintaining algebraic properties for structuring the text data. Recently, attempts have been made to embed not only words but also sentences, paragraphs, and entire documents in various aspects. Particularly, with the demand for analysis of document embedding increases rapidly, many algorithms have been developed to support it. Among them, doc2Vec which extends word2Vec and embeds each document into one vector is most widely used. However, the traditional document embedding method represented by doc2Vec generates a vector for each document using the whole corpus included in the document. This causes a limit that the document vector is affected by not only core words but also miscellaneous words. Additionally, the traditional document embedding schemes usually map each document into a single corresponding vector. Therefore, it is difficult to represent a complex document with multiple subjects into a single vector accurately using the traditional approach. In this paper, we propose a new multi-vector document embedding method to overcome these limitations of the traditional document embedding methods. This study targets documents that explicitly separate body content and keywords. In the case of a document without keywords, this method can be applied after extract keywords through various analysis methods. However, since this is not the core subject of the proposed method, we introduce the process of applying the proposed method to documents that predefine keywords in the text. The proposed method consists of (1) Parsing, (2) Word Embedding, (3) Keyword Vector Extraction, (4) Keyword Clustering, and (5) Multiple-Vector Generation. The specific process is as follows. all text in a document is tokenized and each token is represented as a vector having N-dimensional real value through word embedding. After that, to overcome the limitations of the traditional document embedding method that is affected by not only the core word but also the miscellaneous words, vectors corresponding to the keywords of each document are extracted and make up sets of keyword vector for each document. Next, clustering is conducted on a set of keywords for each document to identify multiple subjects included in the document. Finally, a Multi-vector is generated from vectors of keywords constituting each cluster. The experiments for 3.147 academic papers revealed that the single vector-based traditional approach cannot properly map complex documents because of interference among subjects in each vector. With the proposed multi-vector based method, we ascertained that complex documents can be vectorized more accurately by eliminating the interference among subjects.

Systematic Reviews of Current Domestic Studies of Herbaceous Plants on Anti-diabetes - since 2000 (국내 천연물 항 당뇨 실험연구의 체계적 논문 고찰 - 2000년 ~ 2010년)

  • Choi, You-Kyung
    • Journal of Physiology & Pathology in Korean Medicine
    • /
    • v.25 no.3
    • /
    • pp.389-397
    • /
    • 2011
  • This study tried to integrate the traditional oriental medical theories and results of experimental studies of herbaceous plants on anti-diabetes. And I tried to analyze recent experimental study trend on the anti-diabetic herb. I searched anti-diabetic herb studies on 4 korean databases and 10 korean journals by keywords, 'diabetes', 'blood glucose', 'glycometabolism', 'pancreatic ${\beta}$-cell', etc. In order to see detail review, searching was performed from 2000 to 2010. And I searched 125 study cases concerning anti-diabetic herb and 72 varieties herbaceous plants used in study of anti-diabetes. and I analyzed the choice motives of each herb for anti-diabetic study, the extract methods and anti-diabetic evaluation contents. And I analyzed anti-diabetic herbs from a traditional oriental medical point of view. When the researchers chose herb for anti-diabetic experiment, just 8.8% of the choice was based on the oriental medical evidences. I found that 60.6% of the herb shown to be effective in diabetes experimentally had oriental medical theory-based Properties(性). There were studies with whole plants(16.8%), aqueous extract(45.6%), methanol extract(8.0%), ethanol extract(8.0%) and comparative studies of more than 2 types of extracts or various fractions(18.4%). The most frequent experimental diabetic models was diabetic mouse induced by streptozotocin(STZ)(87.8%). And there were db/db mouse(6.7%), ob/ob mouse(1.1%), etc. 33.6% of all studies just measured hematological indices of diabetes, and 66.4% researches analyzed details. To improve herbaceous plants study on diabetes, we oriental medical scientists have to integrate the oriental medical theories and results of experimental studies.

R&D Perspective Social Issue Packaging using Text Analysis

  • Wong, William Xiu Shun;Kim, Namgyu
    • Journal of Information Technology Services
    • /
    • v.15 no.3
    • /
    • pp.71-95
    • /
    • 2016
  • In recent years, text mining has been used to extract meaningful insights from the large volume of unstructured text data sets of various domains. As one of the most representative text mining applications, topic modeling has been widely used to extract main topics in the form of a set of keywords extracted from a large collection of documents. In general, topic modeling is performed according to the weighted frequency of words in a document corpus. However, general topic modeling cannot discover the relation between documents if the documents share only a few terms, although the documents are in fact strongly related from a particular perspective. For instance, a document about "sexual offense" and another document about "silver industry for aged persons" might not be classified into the same topic because they may not share many key terms. However, these two documents can be strongly related from the R&D perspective because some technologies, such as "RF Tag," "CCTV," and "Heart Rate Sensor," are core components of both "sexual offense" and "silver industry." Thus, in this study, we attempted to discover the differences between the results of general topic modeling and R&D perspective topic modeling. Furthermore, we package social issues from the R&D perspective and present a prototype system, which provides a package of news articles for each R&D issue. Finally, we analyze the quality of R&D perspective topic modeling and provide the results of inter- and intra-topic analysis.

Semantic Ontology Speech Recognition Performance Improvement using ERB Filter (ERB 필터를 이용한 시맨틱 온톨로지 음성 인식 성능 향상)

  • Lee, Jong-Sub
    • Journal of Digital Convergence
    • /
    • v.12 no.10
    • /
    • pp.265-270
    • /
    • 2014
  • Existing speech recognition algorithm have a problem with not distinguish the order of vocabulary, and the voice detection is not the accurate of noise in accordance with recognized environmental changes, and retrieval system, mismatches to user's request are problems because of the various meanings of keywords. In this article, we proposed to event based semantic ontology inference model, and proposed system have a model to extract the speech recognition feature extract using ERB filter. The proposed model was used to evaluate the performance of the train station, train noise. Noise environment of the SNR-10dB, -5dB in the signal was performed to remove the noise. Distortion measure results confirmed the improved performance of 2.17dB, 1.31dB.

Analysis of Baby Moisturizers Marketing in Korea (국내 시판중인 소아용 피부 보습 제품의 분석 및 고찰)

  • Kim, Yoon-Young;Seo, Young-Min;Kim, Jang-Hyun
    • The Journal of Pediatrics of Korean Medicine
    • /
    • v.22 no.3
    • /
    • pp.63-73
    • /
    • 2008
  • Objectives : The purpose of this study is to analyze the baby moisturizers and give information about basis of herbal moisturizer. Methods : We selected 243 goods by searching with keywords as "baby moisturizer" at 6 major web search engines, 12 web shopping malls in Korea. 10 items were evaluated under three evaluation criteria such as type of product, ingredient and function of goods. Results : Study resulted that the most type of moisturizers were lotion type. 80% of the products contained medical agents. Ingredients of medical agents were herbal medicine, aroma oil, vitamin and extract. The moisturizer for atopic dermatitis contained ceramide about half. The ingredient of herbal medicine existed usually as excipients, not as active ingredient. Conclusion : It is necessary to study and develop new products contained herbal medicine as active ingredient based on the oriental medical theory.

  • PDF

Keyword identifications on dimensions for service quality of Healthcare providers (헬스케어 서비스 리뷰를 활용한 서비스 품질 차원 별 중요 단어 파악 방안)

  • Lee, Hong Joo
    • Knowledge Management Research
    • /
    • v.19 no.4
    • /
    • pp.171-185
    • /
    • 2018
  • Studies on online review have carried out analysis of the rating and topic as a whole. However, it is necessary to analyze opinions on various dimensions of service quality. This study classifies reviews of healthcare services into service quality dimensions, and proposes a method to identify words that are mainly referred to in each dimension. Service quality was based on the dimensions provided by SERVQUAL, and patient reviews have collected from NHSChoice. The 2,000 sentences sampled were classified into service quality dimension of SERVQUAL and a method of extracting important keywords from sentences by service quality dimension was suggested. The RAKE algorithm is used to extract key words from a single document and an index is considered to consider frequently used words in various documents. Since we need to identify key words in various reviews, we have considered frequency and discrimination (IDF) at the same time, rather than identifying key words based only on the RAKE score. In SERVQUAL dimension, we identified the words that patients mentioned mainly, and also identified the words that patients mainly refer to by review rating.

Toward Sentiment Analysis Based on Deep Learning with Keyword Detection in a Financial Report (재무 보고서의 키워드 검출 기반 딥러닝 감성분석 기법)

  • Jo, Dongsik;Kim, Daewhan;Shin, Yoojin
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.24 no.5
    • /
    • pp.670-673
    • /
    • 2020
  • Recent advances in artificial intelligence have allowed for easier sentiment analysis (e.g. positive or negative forecast) of documents such as a finance reports. In this paper, we investigate a method to apply text mining techniques to extract in the financial report using deep learning, and propose an accounting model for the effects of sentiment values in financial information. For sentiment analysis with keyword detection in the financial report, we suggest the input layer with extracted keywords, hidden layers by learned weights, and the output layer in terms of sentiment scores. Our approaches can help more effective strategy for potential investors as a professional guideline using sentiment values.

Analysis of Secondary Battery Trends Using Topic Modeling: Focusing on Solid-State Batteries

  • Chunghyun Do;Yong Jin Kim
    • Asian Journal of Innovation and Policy
    • /
    • v.12 no.3
    • /
    • pp.345-362
    • /
    • 2023
  • As the widespread adoption and proliferation of electric vehicles continue, the secondary battery market is experiencing rapid growth. However, lithium-ion batteries, which constitute a majority of secondary batteries, present high risks of fire and explosion. Solid-state batteries are thus garnering attention as the next-generation batteries since they eliminate fire hazards and significantly reduce the risk of explosions. Against this background, the study aimed to analyze research trends and provide insights by examining 2,927 domestic papers related to solid-state batteries over the past decade (2013-2022). Specifically, we used topic modeling to extract major keywords associated with solid-state batteries research and to explore the network characteristics across major topics. The changes in research on solid-state batteries were analyzed in-depth by calculating topic dominance by year. The findings provide an overview of the emerging trends in domestic solid-state battery research, and might serve as a valuable reference in shaping long-term research directions.

Investigating Dynamic Mutation Process of Issues Using Unstructured Text Analysis (비정형 텍스트 분석을 활용한 이슈의 동적 변이과정 고찰)

  • Lim, Myungsu;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.1
    • /
    • pp.1-18
    • /
    • 2016
  • Owing to the extensive use of Web media and the development of the IT industry, a large amount of data has been generated, shared, and stored. Nowadays, various types of unstructured data such as image, sound, video, and text are distributed through Web media. Therefore, many attempts have been made in recent years to discover new value through an analysis of these unstructured data. Among these types of unstructured data, text is recognized as the most representative method for users to express and share their opinions on the Web. In this sense, demand for obtaining new insights through text analysis is steadily increasing. Accordingly, text mining is increasingly being used for different purposes in various fields. In particular, issue tracking is being widely studied not only in the academic world but also in industries because it can be used to extract various issues from text such as news, (SocialNetworkServices) to analyze the trends of these issues. Conventionally, issue tracking is used to identify major issues sustained over a long period of time through topic modeling and to analyze the detailed distribution of documents involved in each issue. However, because conventional issue tracking assumes that the content composing each issue does not change throughout the entire tracking period, it cannot represent the dynamic mutation process of detailed issues that can be created, merged, divided, and deleted between these periods. Moreover, because only keywords that appear consistently throughout the entire period can be derived as issue keywords, concrete issue keywords such as "nuclear test" and "separated families" may be concealed by more general issue keywords such as "North Korea" in an analysis over a long period of time. This implies that many meaningful but short-lived issues cannot be discovered by conventional issue tracking. Note that detailed keywords are preferable to general keywords because the former can be clues for providing actionable strategies. To overcome these limitations, we performed an independent analysis on the documents of each detailed period. We generated an issue flow diagram based on the similarity of each issue between two consecutive periods. The issue transition pattern among categories was analyzed by using the category information of each document. In this study, we then applied the proposed methodology to a real case of 53,739 news articles. We derived an issue flow diagram from the articles. We then proposed the following useful application scenarios for the issue flow diagram presented in the experiment section. First, we can identify an issue that actively appears during a certain period and promptly disappears in the next period. Second, the preceding and following issues of a particular issue can be easily discovered from the issue flow diagram. This implies that our methodology can be used to discover the association between inter-period issues. Finally, an interesting pattern of one-way and two-way transitions was discovered by analyzing the transition patterns of issues through category analysis. Thus, we discovered that a pair of mutually similar categories induces two-way transitions. In contrast, one-way transitions can be recognized as an indicator that issues in a certain category tend to be influenced by other issues in another category. For practical application of the proposed methodology, high-quality word and stop word dictionaries need to be constructed. In addition, not only the number of documents but also additional meta-information such as the read counts, written time, and comments of documents should be analyzed. A rigorous performance evaluation or validation of the proposed methodology should be performed in future works.

Methodology for Issue-related R&D Keywords Packaging Using Text Mining (텍스트 마이닝 기반의 이슈 관련 R&D 키워드 패키징 방법론)

  • Hyun, Yoonjin;Shun, William Wong Xiu;Kim, Namgyu
    • Journal of Internet Computing and Services
    • /
    • v.16 no.2
    • /
    • pp.57-66
    • /
    • 2015
  • Considerable research efforts are being directed towards analyzing unstructured data such as text files and log files using commercial and noncommercial analytical tools. In particular, researchers are trying to extract meaningful knowledge through text mining in not only business but also many other areas such as politics, economics, and cultural studies. For instance, several studies have examined national pending issues by analyzing large volumes of text on various social issues. However, it is difficult to provide successful information services that can identify R&D documents on specific national pending issues. While users may specify certain keywords relating to national pending issues, they usually fail to retrieve appropriate R&D information primarily due to discrepancies between these terms and the corresponding terms actually used in the R&D documents. Thus, we need an intermediate logic to overcome these discrepancies, also to identify and package appropriate R&D information on specific national pending issues. To address this requirement, three methodologies are proposed in this study-a hybrid methodology for extracting and integrating keywords pertaining to national pending issues, a methodology for packaging R&D information that corresponds to national pending issues, and a methodology for constructing an associative issue network based on relevant R&D information. Data analysis techniques such as text mining, social network analysis, and association rules mining are utilized for establishing these methodologies. As the experiment result, the keyword enhancement rate by the proposed integration methodology reveals to be about 42.8%. For the second objective, three key analyses were conducted and a number of association rules between national pending issue keywords and R&D keywords were derived. The experiment regarding to the third objective, which is issue clustering based on R&D keywords is still in progress and expected to give tangible results in the future.