• Title/Summary/Keyword: LDA Topic Model

Search Result 103, Processing Time 0.029 seconds

A study on the classification of research topics based on COVID-19 academic research using Topic modeling (토픽모델링을 활용한 COVID-19 학술 연구 기반 연구 주제 분류에 관한 연구)

  • Yoo, So-yeon;Lim, Gyoo-gun
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.1
    • /
    • pp.155-174
    • /
    • 2022
  • From January 2020 to October 2021, more than 500,000 academic studies related to COVID-19 (Coronavirus-2, a fatal respiratory syndrome) have been published. The rapid increase in the number of papers related to COVID-19 is putting time and technical constraints on healthcare professionals and policy makers to quickly find important research. Therefore, in this study, we propose a method of extracting useful information from text data of extensive literature using LDA and Word2vec algorithm. Papers related to keywords to be searched were extracted from papers related to COVID-19, and detailed topics were identified. The data used the CORD-19 data set on Kaggle, a free academic resource prepared by major research groups and the White House to respond to the COVID-19 pandemic, updated weekly. The research methods are divided into two main categories. First, 41,062 articles were collected through data filtering and pre-processing of the abstracts of 47,110 academic papers including full text. For this purpose, the number of publications related to COVID-19 by year was analyzed through exploratory data analysis using a Python program, and the top 10 journals under active research were identified. LDA and Word2vec algorithm were used to derive research topics related to COVID-19, and after analyzing related words, similarity was measured. Second, papers containing 'vaccine' and 'treatment' were extracted from among the topics derived from all papers, and a total of 4,555 papers related to 'vaccine' and 5,971 papers related to 'treatment' were extracted. did For each collected paper, detailed topics were analyzed using LDA and Word2vec algorithms, and a clustering method through PCA dimension reduction was applied to visualize groups of papers with similar themes using the t-SNE algorithm. A noteworthy point from the results of this study is that the topics that were not derived from the topics derived for all papers being researched in relation to COVID-19 (

    ) were the topic modeling results for each research topic (
    ) was found to be derived from For example, as a result of topic modeling for papers related to 'vaccine', a new topic titled Topic 05 'neutralizing antibodies' was extracted. A neutralizing antibody is an antibody that protects cells from infection when a virus enters the body, and is said to play an important role in the production of therapeutic agents and vaccine development. In addition, as a result of extracting topics from papers related to 'treatment', a new topic called Topic 05 'cytokine' was discovered. A cytokine storm is when the immune cells of our body do not defend against attacks, but attack normal cells. Hidden topics that could not be found for the entire thesis were classified according to keywords, and topic modeling was performed to find detailed topics. In this study, we proposed a method of extracting topics from a large amount of literature using the LDA algorithm and extracting similar words using the Skip-gram method that predicts the similar words as the central word among the Word2vec models. The combination of the LDA model and the Word2vec model tried to show better performance by identifying the relationship between the document and the LDA subject and the relationship between the Word2vec document. In addition, as a clustering method through PCA dimension reduction, a method for intuitively classifying documents by using the t-SNE technique to classify documents with similar themes and forming groups into a structured organization of documents was presented. In a situation where the efforts of many researchers to overcome COVID-19 cannot keep up with the rapid publication of academic papers related to COVID-19, it will reduce the precious time and effort of healthcare professionals and policy makers, and rapidly gain new insights. We hope to help you get It is also expected to be used as basic data for researchers to explore new research directions.

  • A Study on the Research Topics and Trends in South Korea: Focusing on Particulate Matter (토픽모델링을 이용한 국내 미세먼지 연구 분류 및 연구동향 분석)

    • Park, Hyemin;Kim, Taeyong;Kwon, Daewoong;Heo, Junyong;Lee, Juyeon;Yang, Minjune
      • Korean Journal of Remote Sensing
      • /
      • v.38 no.5_3
      • /
      • pp.873-885
      • /
      • 2022
    • The particulate matter (PM) has emerged as a hot topic around the world as it has been reported that PM is related to an increase in mortality and prevalence rates. In South Korea, the importance of PM has been recognized since the late 1990s, and various studies on PM have been conducted. This study investigated the PM research topics and trends for papers (D=2,764) published in Research Information Sharing Service (RISS) using topic modeling based on Latent Dirichlet Allocation (LDA). As a result, a total of 10 topics were identified in the whole papers, and the PM research topics were classified as 'PM reduction (Topic 1)', 'Government policy and management (Topic 2)', 'Characteristics of PM (Topic 3)', 'PM model (Topic 4)', 'Environmental education (Topic 5)', 'Bio (Topic 6)', 'Traffic (Topic 7)', 'Asian dust (Topic 8)', 'Indoor PM (Topic 9)', 'Human risk (Topic 10)'. In particular, the proportion of papers on topics 'Government policy and management (Topic 2)', 'PM model (Topic 4)', 'Environmental education (Topic 5)', and 'Bio (Topic 6)' to the toal number of papers increased over time (linear slope > 0). The results of this study provide the new literature review methodology related to particulate matter and the history and insight.

    Analyzing Students' Non-face-to-face Course Evaluation by Topic Modeling and Developing Deep Learning-based Classification Model (토픽 모델링 기반 비대면 강의평 분석 및 딥러닝 분류 모델 개발)

    • Han, Ji Yeong;Heo, Go Eun
      • Journal of the Korean Society for Library and Information Science
      • /
      • v.55 no.4
      • /
      • pp.267-291
      • /
      • 2021
    • Due to the global pandemic caused by COVID-19 in 2020, there have been major changes in the education sites. Universities have fully introduced remote learning, which was considered as an auxiliary education, and non-face-to-face classes have become commonplace, and professors and students are making great efforts to adapt to the new educational environment. In order to improve the quality of non-face-to-face lectures amid these changes, it is necessary to study the factors affecting lecture satisfaction. Therefore, This paper presents a new methodology using big data to identify the factors affecting university lecture satisfaction changed before and after COVID-19. We use Topic Modeling method to analyze lecture reviews before and after COVID-19, and identify factors affecting lecture satisfaction. Through this, we suggest the direction for university education to move forward. In addition, we can identify the factors of satisfaction and dissatisfaction of lectures from multiangle by establishing a topic classification model with an F1-score of 0.84 based on KoBERT, a deep learning language model, and further contribute to continuous qualitative improvement of lecture satisfaction.

    Identifying Topic-Specific Experts on Microblog

    • Yu, Yan;Mo, Lingfei;Wang, Jian
      • KSII Transactions on Internet and Information Systems (TIIS)
      • /
      • v.10 no.6
      • /
      • pp.2627-2647
      • /
      • 2016
    • With the rapid growth of microblog, expert identification on microblog has been playing a crucial role in many applications. While most previous expert identification studies only assess global authoritativeness of a user, there is no way to differentiate the authoritativeness in a particular aspect of topics. In this paper, we propose a novel model, which jointly models text and following relationship in the same generative process. Furthermore, we integrate a similarity-based weight scheme into the model to address the popular bias problem, and use followee topic distribution as prior information to make user's topic distribution more precisely. Our empirical study on two large real-world datasets shows that our proposed model produces significantly higher quality results than the prior arts.

    A Study of Research on Methods of Automated Biomedical Document Classification using Topic Modeling and Deep Learning (토픽모델링과 딥 러닝을 활용한 생의학 문헌 자동 분류 기법 연구)

    • Yuk, JeeHee;Song, Min
      • Journal of the Korean Society for information Management
      • /
      • v.35 no.2
      • /
      • pp.63-88
      • /
      • 2018
    • This research evaluated differences of classification performance for feature selection methods using LDA topic model and Doc2Vec which is based on word embedding using deep learning, feature corpus sizes and classification algorithms. In addition to find the feature corpus with high performance of classification, an experiment was conducted using feature corpus was composed differently according to the location of the document and by adjusting the size of the feature corpus. Conclusionally, in the experiments using deep learning evaluate training frequency and specifically considered information for context inference. This study constructed biomedical document dataset, Disease-35083 which consisted biomedical scholarly documents provided by PMC and categorized by the disease category. Throughout the study this research verifies which type and size of feature corpus produces the highest performance and, also suggests some feature corpus which carry an extensibility to specific feature by displaying efficiency during the training time. Additionally, this research compares the differences between deep learning and existing method and suggests an appropriate method by classification environment.

    Systemic Analysis of Research Activities and Trends Related to Artificial Intelligence(A.I.) Technology Based on Latent Dirichlet Allocation (LDA) Model (Latent Dirichlet Allocation (LDA) 모델 기반의 인공지능(A.I.) 기술 관련 연구 활동 및 동향 분석)

    • Chung, Myoung Sug;Lee, Joo Yeoun
      • Journal of Korea Society of Industrial Information Systems
      • /
      • v.23 no.3
      • /
      • pp.87-95
      • /
      • 2018
    • Recently, with the technological development of artificial intelligence, related market is expanding rapidly. In the artificial intelligence technology field, which is still in the early stage but still expanding, it is important to reduce uncertainty about research direction and investment field. Therefore, this study examined technology trends using text mining and topic modeling among big data analysis methods and suggested trends of core technology and future growth potential. We hope that the results of this study will provide researchers with an understanding of artificial intelligence technology trends and new implications for future research directions.

    Examining Suicide Tendency Social Media Texts by Deep Learning and Topic Modeling Techniques (딥러닝 및 토픽모델링 기법을 활용한 소셜 미디어의 자살 경향 문헌 판별 및 분석)

    • Ko, Young Soo;Lee, Ju Hee;Song, Min
      • Journal of the Korean BIBLIA Society for library and Information Science
      • /
      • v.32 no.3
      • /
      • pp.247-264
      • /
      • 2021
    • This study aims to create a deep learning-based classification model to classify suicide tendency by suicide corpus constructed for the present study. Also, to analyze suicide factors, the study classified suicide tendency corpus into detailed topics by using topic modeling, an analysis technique that automatically extracts topics. For this purpose, 2,011 documents of the suicide-related corpus collected from social media naver knowledge iN were directly annotated into suicide-tendency documents or non-suicide-tendency documents based on suicide prevention education manual issued by the Central Suicide Prevention Center, and we also conducted the deep learning model(LSTM, BERT, ELECTRA) performance evaluation based on the classification model, using annotated corpus data. In addition, one of the topic modeling techniques, LDA identified suicide factors by classifying thematic literature, and co-word analysis and visualization were conducted to analyze the factors in-depth.

    Comparison of policy perceptions between national R&D projects and standing committees using topic modeling analysis : focusing on the ICT field (토픽모델링 분석을 활용한 국가연구개발사업과제와 국회 상임위원회 사이의 정책 인식 비교 : ICT 분야를 중심으로)

    • Song, Byoungki;Kim, Sangung
      • Journal of Industrial Convergence
      • /
      • v.20 no.7
      • /
      • pp.1-11
      • /
      • 2022
    • In this paper, numerical values are derived using topic modeling among data-based evaluation methodologies discussed by various research institutes. In addition, we will focus on the ICT field to see if there is a difference in policy perception between the national R&D project and standing committee. First, we create model for classifying ICT documents by learning R&D project data using HAN model. And we perform LDA topic modeling analysis on ICT documents classified by applying the model, compare the distribution with the topics derived from the R&D project data and proceedings of standing committees. Specifically, a total of 26 topics were derived. Also, R&D project data had professionally topics, and the standing committee-discuss relatively social and popular issues. As the difference in perception can be numerically confirmed, it can be used as a basic study on indicators that can be used for future policy or project evaluation.

    Product Evaluation Criteria Extraction through Online Review Analysis: Using LDA and k-Nearest Neighbor Approach (온라인 리뷰 분석을 통한 상품 평가 기준 추출: LDA 및 k-최근접 이웃 접근법을 활용하여)

    • Lee, Ji Hyeon;Jung, Sang Hyung;Kim, Jun Ho;Min, Eun Joo;Yeo, Un Yeong;Kim, Jong Woo
      • Journal of Intelligence and Information Systems
      • /
      • v.26 no.1
      • /
      • pp.97-117
      • /
      • 2020
    • Product evaluation criteria is an indicator describing attributes or values of products, which enable users or manufacturers measure and understand the products. When companies analyze their products or compare them with competitors, appropriate criteria must be selected for objective evaluation. The criteria should show the features of products that consumers considered when they purchased, used and evaluated the products. However, current evaluation criteria do not reflect different consumers' opinion from product to product. Previous studies tried to used online reviews from e-commerce sites that reflect consumer opinions to extract the features and topics of products and use them as evaluation criteria. However, there is still a limit that they produce irrelevant criteria to products due to extracted or improper words are not refined. To overcome this limitation, this research suggests LDA-k-NN model which extracts possible criteria words from online reviews by using LDA and refines them with k-nearest neighbor. Proposed approach starts with preparation phase, which is constructed with 6 steps. At first, it collects review data from e-commerce websites. Most e-commerce websites classify their selling items by high-level, middle-level, and low-level categories. Review data for preparation phase are gathered from each middle-level category and collapsed later, which is to present single high-level category. Next, nouns, adjectives, adverbs, and verbs are extracted from reviews by getting part of speech information using morpheme analysis module. After preprocessing, words per each topic from review are shown with LDA and only nouns in topic words are chosen as potential words for criteria. Then, words are tagged based on possibility of criteria for each middle-level category. Next, every tagged word is vectorized by pre-trained word embedding model. Finally, k-nearest neighbor case-based approach is used to classify each word with tags. After setting up preparation phase, criteria extraction phase is conducted with low-level categories. This phase starts with crawling reviews in the corresponding low-level category. Same preprocessing as preparation phase is conducted using morpheme analysis module and LDA. Possible criteria words are extracted by getting nouns from the data and vectorized by pre-trained word embedding model. Finally, evaluation criteria are extracted by refining possible criteria words using k-nearest neighbor approach and reference proportion of each word in the words set. To evaluate the performance of the proposed model, an experiment was conducted with review on '11st', one of the biggest e-commerce companies in Korea. Review data were from 'Electronics/Digital' section, one of high-level categories in 11st. For performance evaluation of suggested model, three other models were used for comparing with the suggested model; actual criteria of 11st, a model that extracts nouns by morpheme analysis module and refines them according to word frequency, and a model that extracts nouns from LDA topics and refines them by word frequency. The performance evaluation was set to predict evaluation criteria of 10 low-level categories with the suggested model and 3 models above. Criteria words extracted from each model were combined into a single words set and it was used for survey questionnaires. In the survey, respondents chose every item they consider as appropriate criteria for each category. Each model got its score when chosen words were extracted from that model. The suggested model had higher scores than other models in 8 out of 10 low-level categories. By conducting paired t-tests on scores of each model, we confirmed that the suggested model shows better performance in 26 tests out of 30. In addition, the suggested model was the best model in terms of accuracy. This research proposes evaluation criteria extracting method that combines topic extraction using LDA and refinement with k-nearest neighbor approach. This method overcomes the limits of previous dictionary-based models and frequency-based refinement models. This study can contribute to improve review analysis for deriving business insights in e-commerce market.

    A Study on the Research Topics and Trends in Korean Journal of Remote Sensing: Focusing on Natural & Environmental Disasters (토픽모델링을 이용한 대한원격탐사학회지의 연구주제 분류 및 연구동향 분석: 자연·환경재해 분야를 중심으로)

    • Kim, Taeyong;Park, Hyemin;Heo, Junyong;Yang, Minjune
      • Korean Journal of Remote Sensing
      • /
      • v.37 no.6_2
      • /
      • pp.1869-1880
      • /
      • 2021
    • Korean Journal of Remote Sensing (KJRS), leading the field of remote sensing and GIS in South Korea for over 37 years, has published interdisciplinary research papers. In this study, we performed the topic modeling based on Latent Dirichlet Allocation (LDA), a probabilistic generative model, to identify the research topics and trends using 1) the whole articles, and 2) specific articles related to natural and environmental disasters published in KJRS by analyzing titles, keywords, and abstracts. The results of LDA showed that 4 topics('Polar', 'Hydrosphere', 'Geosphere', and 'Atmosphere') were identified in the whole articles and the topic of 'Polar' was dominant among them (linear slope=3.51 × 10-3, p<0.05) over time. For the specific articles related to natural and environmental disasters, the optimal number of topics were 7 ('Marine pollution', 'Air pollution', 'Volcano', 'Wildfire', 'Flood', 'Drought', and 'Heavy rain') and the topic of 'Air pollution' was dominant (linear slope=2.61 × 10-3, p<0.05) over time. The results from this study provide the history and insight into natural and environmental disasters in KRJS with multidisciplinary researchers.


    (34141) Korea Institute of Science and Technology Information, 245, Daehak-ro, Yuseong-gu, Daejeon
    Copyright (C) KISTI. All Rights Reserved.