Search | Korea Science

Development of a water quality prediction model for mineral springs in the metropolitan area using machine learning (머신러닝을 활용한 수도권 약수터 수질 예측 모델 개발)

Yeong-Woo Lim;Ji-Yeon Eom;Kee-Young Kwahk
- Journal of Intelligence and Information Systems
- /
- v.29 no.1
- /
- pp.307-325
- /
- 2023
Due to the prolonged COVID-19 pandemic, the frequency of people who are tired of living indoors visiting nearby mountains and national parks to relieve depression and lethargy has exploded. There is a place where thousands of people who came out of nature stop walking and breathe and rest, that is the mineral spring. Even in mountains or national parks, there are about 600 mineral springs that can be found occasionally in neighboring parks or trails in the metropolitan area. However, due to irregular and manual water quality tests, people drink mineral water without knowing the test results in real time. Therefore, in this study, we intend to develop a model that can predict the quality of the spring water in real time by exploring the factors affecting the quality of the spring water and collecting data scattered in various places. After limiting the regions to Seoul and Gyeonggi-do due to the limitations of data collection, we obtained data on water quality tests from 2015 to 2020 for about 300 mineral springs in 18 cities where data management is well performed. A total of 10 factors were finally selected after two rounds of review among various factors that are considered to affect the suitability of the mineral spring water quality. Using AutoML, an automated machine learning technology that has recently been attracting attention, we derived the top 5 models based on prediction performance among about 20 machine learning methods. Among them, the catboost model has the highest performance with a prediction classification accuracy of 75.26%. In addition, as a result of examining the absolute influence of the variables used in the analysis through the SHAP method on the prediction, the most important factor was whether or not a water quality test was judged nonconforming in the previous water quality test. It was confirmed that the temperature on the day of the inspection and the altitude of the mineral spring had an influence on whether the water quality was unsuitable.
https://doi.org/10.13088/jiis.2023.29.1.307 인용 PDF

Digital Archives of Cultural Archetype Contents: Its Problems and Direction (디지털 아카이브즈의 문제점과 방향 - 문화원형 콘텐츠를 중심으로 -)

Hahm, Han-Hee;Park, Soon-Cheol
- Journal of the Korean BIBLIA Society for library and Information Science
- /
- v.17 no.2
- /
- pp.23-42
- /
- 2006
This is a study of the digital archives of Culturecontent.com where 'Cultural Archetype Contents' are currently in service. One of the major purposes of our study is to point out problems in the current system and eventually propose improvements to the digital archives. The government launched a four-year project for developing the cultural archetype content sources and establishing its related business with the hope of enhancing the nation's competitiveness. More specifically, the project focuses on the production of source materials of cultural archetype contents in the subjects of Korea's history. tradition, everyday life. arts and general geographical books. In addition, through this project, the government also intends to establish a proper distribution system of digitalized culture contents and to control copyright issues. This paper analyzes the digital archives system that stores the culture content data that have been produced from 2002 to 2005 and evaluates the current system's weaknesses and strengths. The summary of our findings is as follows. First. the digital archives system does not contain a semantic search engine and therefore its full function is 1agged. Second, similar data is not classified into the same categories but into the different ones, thereby confusing and inconveniencing users. Users who want to find source materials could be disappointed by the current distributive system. Our paper suggests a better system of digital archives with text mining technology which consists of five significant intelligent process-keyword searches, summarization, clustering, classification and topic tracking. Our paper endeavors to develop the best technical environment for preserving and using culture contents data. With the new digitalized upgraded settings, users of culture contents data will discover a world of new knowledge. The technology we introduce in this paper will lead to the highest achievable digital intelligence through a new framework.
https://doi.org/10.14699/kbiblia.2006.17.2.023 인용 PDF

Predicting the Potential Habitat and Future Distribution of Brachydiplax chalybea flavovittata Ris, 1911 (Odonata: Libellulidae) (기후변화에 따른 남색이마잠자리 잠재적 서식지 및 미래 분포예측)

Soon Jik Kwon;Yung Chul Jun;Hyeok Yeong Kwon;In Chul Hwang;Chang Su Lee;Tae Geun Kim
- Journal of Wetlands Research
- /
- v.25 no.4
- /
- pp.335-344
- /
- 2023
Brachydiplax chalybea flavovittata, a climate-sensitive biological indicator species, was first observed and recorded at Jeju Island in Korea in 2010. Overwintering was recently confirmed in the Yeongsan River area. This study was aimed to predict the potential distribution patterns for the larvae of B. chalybea flavovittata and to understand its ecological characteristics as well as changes of population under global climate change circumstances. Data was collected both from the Global Biodiversity Information Facility (GBIF) and by field surveys from May 2019 to May 2023. We used for the distribution model among downloaded 19 variables from the WorldClim database. MaxEnt model was adopted for the prediction of potential and future distribution for B. chalybea flavovittata. Larval distribution ranged within a region delimited by northern latitude from Jeju-si, Jeju Special Self-Governing Province (33.318096°) to Yeoju-si, Gyeonggi-do (37.366734°) and eastern longitude from Jindo-gun, Jeollanam-do (126.054925°) to Yangsan-si, Gyeongsangnam-do (129.016472°). M type (permanent rivers, streams and creeks) wetlands were the most common habitat based on the Ramsar's wetland classification system, followed by Tp type (permanent freshwater marshes and pools) (45.8%) and F type (estuarine waters) (4.2%). MaxEnt model presented that potential distribution with high inhabiting probability included Ulsan and Daegu Metropolitan City in addition to the currently discovered habitats. Applying to the future scenarios by Intergovernmental Panel on Climate Change (IPCC), it was predicted that the possible distribution area would expand in the 2050s and 2090s, covering the southern and western coastal regions, the southern Daegu metropolitan area and the eastern coastal regions in the near future. This study suggests that B. chalybea flavovittata can be used as an effective indicator species for climate changes with a monitoring of their distribution ranges. Our findings will also help to provide basic information on the conservation and management of co-existing native species.
https://doi.org/10.17663/JWR.2023.25.4.335 인용 PDF HTML

Performance Evaluation of Monitoring System for Sargassum horneri Using GOCI-II: Focusing on the Results of Removing False Detection in the Yellow Sea and East China Sea (GOCI-II 기반 괭생이모자반 모니터링 시스템 성능 평가: 황해 및 동중국해 해역 오탐지 제거 결과를 중심으로)

Han-bit Lee;Ju-Eun Kim;Moon-Seon Kim;Dong-Su Kim;Seung-Hwan Min;Tae-Ho Kim
- Korean Journal of Remote Sensing
- /
- v.39 no.6_2
- /
- pp.1615-1633
- /
- 2023
Sargassum horneri is one of the floating algae in the sea, which breeds in large quantities in the Yellow Sea and East China Sea and then flows into the coast of Republic of Korea, causing various problems such as destroying the environment and damaging fish farms. In order to effectively prevent damage and preserve the coastal environment, the development of Sargassum horneri detection algorithms using satellite-based remote sensing technology has been actively developed. However, incorrect detection information causes an increase in the moving distance of ships collecting Sargassum horneri and confusion in the response of related local governments or institutions,so it is very important to minimize false detections when producing Sargassum horneri spatial information. This study applied technology to automatically remove false detection results using the GOCI-II-based Sargassum horneri detection algorithm of the National Ocean Satellite Center (NOSC) of the Korea Hydrographic and Oceanography Agency (KHOA). Based on the results of analyzing the causes of major false detection results, it includes a process of removing linear and sporadic false detections and green algae that occurs in large quantities along the coast of China in spring and summer by considering them as false detections. The technology to automatically remove false detection was applied to the dates when Sargassum horneri occurred from February 24 to June 25, 2022. Visual assessment results were generated using mid-resolution satellite images, qualitative and quantitative evaluations were performed. Linear false detection results were completely removed, and most of the sporadic and green algae false detection results that affected the distribution were removed. Even after the automatic false detection removal process, it was possible to confirm the distribution area of Sargassum horneri compared to the visual assessment results, and the accuracy and precision calculated using the binary classification model averaged 97.73% and 95.4%, respectively. Recall value was very low at 29.03%, which is presumed to be due to the effect of Sargassum horneri movement due to the observation time discrepancy between GOCI-II and mid-resolution satellite images, differences in spatial resolution, location deviation by orthocorrection, and cloud masking. The results of this study's removal of false detections of Sargassum horneri can determine the spatial distribution status in near real-time, but there are limitations in accurately estimating biomass. Therefore, continuous research on upgrading the Sargassum horneri monitoring system must be conducted to use it as data for establishing future Sargassum horneri response plans.
https://doi.org/10.7780/kjrs.2023.39.6.2.9 인용 PDF HTML

A Study on Kiosk Satisfaction Level Improvement: Focusing on Kano, Timko, and PCSI Methodology (키오스크 소비자의 만족수준 연구: Kano, Timko, PCSI 방법론을 중심으로)

Choi, Jaehoon;Kim, Pansoo
- Asia-Pacific Journal of Business Venturing and Entrepreneurship
- /
- v.17 no.4
- /
- pp.193-204
- /
- 2022
This study analyzed the degree of influence of measurement and improvement of customer satisfaction level targeting kiosk users. In modern times, due to the development of technology and the improvement of the online environment, the probability that simple labor tasks will disappear after 10 years is close to 90%. Even in domestic research, it is predicted that 'simple labor jobs' will disappear due to the influence of advanced technology with a probability of about 36%. there is. In particular, as the demand for non-face-to-face services increases due to the Corona 19 virus, which is recently spreading globally, the trend of introducing kiosks has accelerated, and the global market will grow to 83.5 billion won in 2021, showing an average annual growth rate of 8.9%. there is. However, due to the unmanned nature of these kiosks, some consumers still have difficulties in using them, and consumers who are not familiar with the use of these technologies have a negative attitude towards service co-producers due to rejection of non-face-to-face services and anxiety about service errors. Lack of understanding leads to role conflicts between sales clerks and consumers, or inequality is being created in terms of service provision and generations accustomed to using technology. In addition, since kiosk is a representative technology-based self-service industry, if the user feels uncomfortable or requires additional labor, the overall service value decreases and the growth of the kiosk industry itself can be suppressed. It is important. Therefore, interviews were conducted on the main points of direct use with actual users centered on display color scheme, text size, device design, device size, internal UI (interface), amount of information, recognition sensor (barcode, NFC, etc.), Display brightness, self-event, and reaction speed items were extracted. Afterwards, using the questionnaire, the Kano model quality attribute classification of each expected evaluation item was carried out, and Timko's customer satisfaction coefficient, which can be calculated with accurate numerical values The PCSI Index analysis was additionally performed to determine the improvement priorities by finally classifying the improvement impact of the kiosk expected evaluation items through research. As a result, the impact of improvement appears in the order of internal UI (interface), text size, recognition sensor (barcode, NFC, etc.), reaction speed, self-event, display brightness, amount of information, device size, device design, and display color scheme. Through this, we intend to contribute to a comprehensive comparison of kiosk-based research in each field and to set the direction for improvement in the venture industry.
PDF KSCI

Clickstream Big Data Mining for Demographics based Digital Marketing (인구통계특성 기반 디지털 마케팅을 위한 클릭스트림 빅데이터 마이닝)

Park, Jiae;Cho, Yoonho
- Journal of Intelligence and Information Systems
- /
- v.22 no.3
- /
- pp.143-163
- /
- 2016
The demographics of Internet users are the most basic and important sources for target marketing or personalized advertisements on the digital marketing channels which include email, mobile, and social media. However, it gradually has become difficult to collect the demographics of Internet users because their activities are anonymous in many cases. Although the marketing department is able to get the demographics using online or offline surveys, these approaches are very expensive, long processes, and likely to include false statements. Clickstream data is the recording an Internet user leaves behind while visiting websites. As the user clicks anywhere in the webpage, the activity is logged in semi-structured website log files. Such data allows us to see what pages users visited, how long they stayed there, how often they visited, when they usually visited, which site they prefer, what keywords they used to find the site, whether they purchased any, and so forth. For such a reason, some researchers tried to guess the demographics of Internet users by using their clickstream data. They derived various independent variables likely to be correlated to the demographics. The variables include search keyword, frequency and intensity for time, day and month, variety of websites visited, text information for web pages visited, etc. The demographic attributes to predict are also diverse according to the paper, and cover gender, age, job, location, income, education, marital status, presence of children. A variety of data mining methods, such as LSA, SVM, decision tree, neural network, logistic regression, and k-nearest neighbors, were used for prediction model building. However, this research has not yet identified which data mining method is appropriate to predict each demographic variable. Moreover, it is required to review independent variables studied so far and combine them as needed, and evaluate them for building the best prediction model. The objective of this study is to choose clickstream attributes mostly likely to be correlated to the demographics from the results of previous research, and then to identify which data mining method is fitting to predict each demographic attribute. Among the demographic attributes, this paper focus on predicting gender, age, marital status, residence, and job. And from the results of previous research, 64 clickstream attributes are applied to predict the demographic attributes. The overall process of predictive model building is compose of 4 steps. In the first step, we create user profiles which include 64 clickstream attributes and 5 demographic attributes. The second step performs the dimension reduction of clickstream variables to solve the curse of dimensionality and overfitting problem. We utilize three approaches which are based on decision tree, PCA, and cluster analysis. We build alternative predictive models for each demographic variable in the third step. SVM, neural network, and logistic regression are used for modeling. The last step evaluates the alternative models in view of model accuracy and selects the best model. For the experiments, we used clickstream data which represents 5 demographics and 16,962,705 online activities for 5,000 Internet users. IBM SPSS Modeler 17.0 was used for our prediction process, and the 5-fold cross validation was conducted to enhance the reliability of our experiments. As the experimental results, we can verify that there are a specific data mining method well-suited for each demographic variable. For example, age prediction is best performed when using the decision tree based dimension reduction and neural network whereas the prediction of gender and marital status is the most accurate by applying SVM without dimension reduction. We conclude that the online behaviors of the Internet users, captured from the clickstream data analysis, could be well used to predict their demographics, thereby being utilized to the digital marketing.
https://doi.org/10.13088/jiis.2016.22.3.143 인용 PDF KSCI

Automatic Quality Evaluation with Completeness and Succinctness for Text Summarization (완전성과 간결성을 고려한 텍스트 요약 품질의 자동 평가 기법)

Ko, Eunjung;Kim, Namgyu
- Journal of Intelligence and Information Systems
- /
- v.24 no.2
- /
- pp.125-148
- /
- 2018
Recently, as the demand for big data analysis increases, cases of analyzing unstructured data and using the results are also increasing. Among the various types of unstructured data, text is used as a means of communicating information in almost all fields. In addition, many analysts are interested in the amount of data is very large and relatively easy to collect compared to other unstructured and structured data. Among the various text analysis applications, document classification which classifies documents into predetermined categories, topic modeling which extracts major topics from a large number of documents, sentimental analysis or opinion mining that identifies emotions or opinions contained in texts, and Text Summarization which summarize the main contents from one document or several documents have been actively studied. Especially, the text summarization technique is actively applied in the business through the news summary service, the privacy policy summary service, ect. In addition, much research has been done in academia in accordance with the extraction approach which provides the main elements of the document selectively and the abstraction approach which extracts the elements of the document and composes new sentences by combining them. However, the technique of evaluating the quality of automatically summarized documents has not made much progress compared to the technique of automatic text summarization. Most of existing studies dealing with the quality evaluation of summarization were carried out manual summarization of document, using them as reference documents, and measuring the similarity between the automatic summary and reference document. Specifically, automatic summarization is performed through various techniques from full text, and comparison with reference document, which is an ideal summary document, is performed for measuring the quality of automatic summarization. Reference documents are provided in two major ways, the most common way is manual summarization, in which a person creates an ideal summary by hand. Since this method requires human intervention in the process of preparing the summary, it takes a lot of time and cost to write the summary, and there is a limitation that the evaluation result may be different depending on the subject of the summarizer. Therefore, in order to overcome these limitations, attempts have been made to measure the quality of summary documents without human intervention. On the other hand, as a representative attempt to overcome these limitations, a method has been recently devised to reduce the size of the full text and to measure the similarity of the reduced full text and the automatic summary. In this method, the more frequent term in the full text appears in the summary, the better the quality of the summary. However, since summarization essentially means minimizing a lot of content while minimizing content omissions, it is unreasonable to say that a "good summary" based on only frequency always means a "good summary" in its essential meaning. In order to overcome the limitations of this previous study of summarization evaluation, this study proposes an automatic quality evaluation for text summarization method based on the essential meaning of summarization. Specifically, the concept of succinctness is defined as an element indicating how few duplicated contents among the sentences of the summary, and completeness is defined as an element that indicating how few of the contents are not included in the summary. In this paper, we propose a method for automatic quality evaluation of text summarization based on the concepts of succinctness and completeness. In order to evaluate the practical applicability of the proposed methodology, 29,671 sentences were extracted from TripAdvisor 's hotel reviews, summarized the reviews by each hotel and presented the results of the experiments conducted on evaluation of the quality of summaries in accordance to the proposed methodology. It also provides a way to integrate the completeness and succinctness in the trade-off relationship into the F-Score, and propose a method to perform the optimal summarization by changing the threshold of the sentence similarity.
https://doi.org/10.13088/jiis.2018.24.2.125 인용 PDF KSCI

Ensemble of Nested Dichotomies for Activity Recognition Using Accelerometer Data on Smartphone (Ensemble of Nested Dichotomies 기법을 이용한 스마트폰 가속도 센서 데이터 기반의 동작 인지)

Ha, Eu Tteum;Kim, Jeongmin;Ryu, Kwang Ryel
- Journal of Intelligence and Information Systems
- /
- v.19 no.4
- /
- pp.123-132
- /
- 2013
As the smartphones are equipped with various sensors such as the accelerometer, GPS, gravity sensor, gyros, ambient light sensor, proximity sensor, and so on, there have been many research works on making use of these sensors to create valuable applications. Human activity recognition is one such application that is motivated by various welfare applications such as the support for the elderly, measurement of calorie consumption, analysis of lifestyles, analysis of exercise patterns, and so on. One of the challenges faced when using the smartphone sensors for activity recognition is that the number of sensors used should be minimized to save the battery power. When the number of sensors used are restricted, it is difficult to realize a highly accurate activity recognizer or a classifier because it is hard to distinguish between subtly different activities relying on only limited information. The difficulty gets especially severe when the number of different activity classes to be distinguished is very large. In this paper, we show that a fairly accurate classifier can be built that can distinguish ten different activities by using only a single sensor data, i.e., the smartphone accelerometer data. The approach that we take to dealing with this ten-class problem is to use the ensemble of nested dichotomy (END) method that transforms a multi-class problem into multiple two-class problems. END builds a committee of binary classifiers in a nested fashion using a binary tree. At the root of the binary tree, the set of all the classes are split into two subsets of classes by using a binary classifier. At a child node of the tree, a subset of classes is again split into two smaller subsets by using another binary classifier. Continuing in this way, we can obtain a binary tree where each leaf node contains a single class. This binary tree can be viewed as a nested dichotomy that can make multi-class predictions. Depending on how a set of classes are split into two subsets at each node, the final tree that we obtain can be different. Since there can be some classes that are correlated, a particular tree may perform better than the others. However, we can hardly identify the best tree without deep domain knowledge. The END method copes with this problem by building multiple dichotomy trees randomly during learning, and then combining the predictions made by each tree during classification. The END method is generally known to perform well even when the base learner is unable to model complex decision boundaries As the base classifier at each node of the dichotomy, we have used another ensemble classifier called the random forest. A random forest is built by repeatedly generating a decision tree each time with a different random subset of features using a bootstrap sample. By combining bagging with random feature subset selection, a random forest enjoys the advantage of having more diverse ensemble members than a simple bagging. As an overall result, our ensemble of nested dichotomy can actually be seen as a committee of committees of decision trees that can deal with a multi-class problem with high accuracy. The ten classes of activities that we distinguish in this paper are 'Sitting', 'Standing', 'Walking', 'Running', 'Walking Uphill', 'Walking Downhill', 'Running Uphill', 'Running Downhill', 'Falling', and 'Hobbling'. The features used for classifying these activities include not only the magnitude of acceleration vector at each time point but also the maximum, the minimum, and the standard deviation of vector magnitude within a time window of the last 2 seconds, etc. For experiments to compare the performance of END with those of other methods, the accelerometer data has been collected at every 0.1 second for 2 minutes for each activity from 5 volunteers. Among these 5,900 ($=5{\times}(60{\times}2-2)/0.1$) data collected for each activity (the data for the first 2 seconds are trashed because they do not have time window data), 4,700 have been used for training and the rest for testing. Although 'Walking Uphill' is often confused with some other similar activities, END has been found to classify all of the ten activities with a fairly high accuracy of 98.4%. On the other hand, the accuracies achieved by a decision tree, a k-nearest neighbor, and a one-versus-rest support vector machine have been observed as 97.6%, 96.5%, and 97.6%, respectively.
https://doi.org/10.13088/jiis.2013.19.4.123 인용 PDF KSCI

A Deep Learning Based Approach to Recognizing Accompanying Status of Smartphone Users Using Multimodal Data (스마트폰 다종 데이터를 활용한 딥러닝 기반의 사용자 동행 상태 인식)

Kim, Kilho;Choi, Sangwoo;Chae, Moon-jung;Park, Heewoong;Lee, Jaehong;Park, Jonghun
- Journal of Intelligence and Information Systems
- /
- v.25 no.1
- /
- pp.163-177
- /
- 2019
As smartphones are getting widely used, human activity recognition (HAR) tasks for recognizing personal activities of smartphone users with multimodal data have been actively studied recently. The research area is expanding from the recognition of the simple body movement of an individual user to the recognition of low-level behavior and high-level behavior. However, HAR tasks for recognizing interaction behavior with other people, such as whether the user is accompanying or communicating with someone else, have gotten less attention so far. And previous research for recognizing interaction behavior has usually depended on audio, Bluetooth, and Wi-Fi sensors, which are vulnerable to privacy issues and require much time to collect enough data. Whereas physical sensors including accelerometer, magnetic field and gyroscope sensors are less vulnerable to privacy issues and can collect a large amount of data within a short time. In this paper, a method for detecting accompanying status based on deep learning model by only using multimodal physical sensor data, such as an accelerometer, magnetic field and gyroscope, was proposed. The accompanying status was defined as a redefinition of a part of the user interaction behavior, including whether the user is accompanying with an acquaintance at a close distance and the user is actively communicating with the acquaintance. A framework based on convolutional neural networks (CNN) and long short-term memory (LSTM) recurrent networks for classifying accompanying and conversation was proposed. First, a data preprocessing method which consists of time synchronization of multimodal data from different physical sensors, data normalization and sequence data generation was introduced. We applied the nearest interpolation to synchronize the time of collected data from different sensors. Normalization was performed for each x, y, z axis value of the sensor data, and the sequence data was generated according to the sliding window method. Then, the sequence data became the input for CNN, where feature maps representing local dependencies of the original sequence are extracted. The CNN consisted of 3 convolutional layers and did not have a pooling layer to maintain the temporal information of the sequence data. Next, LSTM recurrent networks received the feature maps, learned long-term dependencies from them and extracted features. The LSTM recurrent networks consisted of two layers, each with 128 cells. Finally, the extracted features were used for classification by softmax classifier. The loss function of the model was cross entropy function and the weights of the model were randomly initialized on a normal distribution with an average of 0 and a standard deviation of 0.1. The model was trained using adaptive moment estimation (ADAM) optimization algorithm and the mini batch size was set to 128. We applied dropout to input values of the LSTM recurrent networks to prevent overfitting. The initial learning rate was set to 0.001, and it decreased exponentially by 0.99 at the end of each epoch training. An Android smartphone application was developed and released to collect data. We collected smartphone data for a total of 18 subjects. Using the data, the model classified accompanying and conversation by 98.74% and 98.83% accuracy each. Both the F1 score and accuracy of the model were higher than the F1 score and accuracy of the majority vote classifier, support vector machine, and deep recurrent neural network. In the future research, we will focus on more rigorous multimodal sensor data synchronization methods that minimize the time stamp differences. In addition, we will further study transfer learning method that enables transfer of trained models tailored to the training data to the evaluation data that follows a different distribution. It is expected that a model capable of exhibiting robust recognition performance against changes in data that is not considered in the model learning stage will be obtained.
https://doi.org/10.13088/jiis.2019.25.1.163 인용 PDF KSCI HTML

A study on the classification of research topics based on COVID-19 academic research using Topic modeling (토픽모델링을 활용한 COVID-19 학술 연구 기반 연구 주제 분류에 관한 연구)

Yoo, So-yeon;Lim, Gyoo-gun
- Journal of Intelligence and Information Systems
- /
- v.28 no.1
- /
- pp.155-174
- /
- 2022

From January 2020 to October 2021, more than 500,000 academic studies related to COVID-19 (Coronavirus-2, a fatal respiratory syndrome) have been published. The rapid increase in the number of papers related to COVID-19 is putting time and technical constraints on healthcare professionals and policy makers to quickly find important research. Therefore, in this study, we propose a method of extracting useful information from text data of extensive literature using LDA and Word2vec algorithm. Papers related to keywords to be searched were extracted from papers related to COVID-19, and detailed topics were identified. The data used the CORD-19 data set on Kaggle, a free academic resource prepared by major research groups and the White House to respond to the COVID-19 pandemic, updated weekly. The research methods are divided into two main categories. First, 41,062 articles were collected through data filtering and pre-processing of the abstracts of 47,110 academic papers including full text. For this purpose, the number of publications related to COVID-19 by year was analyzed through exploratory data analysis using a Python program, and the top 10 journals under active research were identified. LDA and Word2vec algorithm were used to derive research topics related to COVID-19, and after analyzing related words, similarity was measured. Second, papers containing 'vaccine' and 'treatment' were extracted from among the topics derived from all papers, and a total of 4,555 papers related to 'vaccine' and 5,971 papers related to 'treatment' were extracted. did For each collected paper, detailed topics were analyzed using LDA and Word2vec algorithms, and a clustering method through PCA dimension reduction was applied to visualize groups of papers with similar themes using the t-SNE algorithm. A noteworthy point from the results of this study is that the topics that were not derived from the topics derived for all papers being researched in relation to COVID-19 (

) were the topic modeling results for each research topic (

) was found to be derived from For example, as a result of topic modeling for papers related to 'vaccine', a new topic titled Topic 05 'neutralizing antibodies' was extracted. A neutralizing antibody is an antibody that protects cells from infection when a virus enters the body, and is said to play an important role in the production of therapeutic agents and vaccine development. In addition, as a result of extracting topics from papers related to 'treatment', a new topic called Topic 05 'cytokine' was discovered. A cytokine storm is when the immune cells of our body do not defend against attacks, but attack normal cells. Hidden topics that could not be found for the entire thesis were classified according to keywords, and topic modeling was performed to find detailed topics. In this study, we proposed a method of extracting topics from a large amount of literature using the LDA algorithm and extracting similar words using the Skip-gram method that predicts the similar words as the central word among the Word2vec models. The combination of the LDA model and the Word2vec model tried to show better performance by identifying the relationship between the document and the LDA subject and the relationship between the Word2vec document. In addition, as a clustering method through PCA dimension reduction, a method for intuitively classifying documents by using the t-SNE technique to classify documents with similar themes and forming groups into a structured organization of documents was presented. In a situation where the efforts of many researchers to overcome COVID-19 cannot keep up with the rapid publication of academic papers related to COVID-19, it will reduce the precious time and effort of healthcare professionals and policy makers, and rapidly gain new insights. We hope to help you get It is also expected to be used as basic data for researchers to explore new research directions.

https://doi.org/10.13088/jiis.2022.28.1.155 인용 PDF KSCI

맨앞
이전
821
822
823
824
825현재
다음
맨뒤
825 / 831 pages

Search Result 8,303, Processing Time 0.039 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)