• Title/Summary/Keyword: TextMining

Search Result 1,563, Processing Time 0.024 seconds

Analyzing Research Trends on Research Support Services Using Topic Modeling (토픽모델링을 활용한 국내외 연구지원서비스 연구동향 분석)

  • Ji Soo Kim;Yoo Kyung Jeong
    • Journal of the Korean Society for information Management
    • /
    • v.41 no.3
    • /
    • pp.309-330
    • /
    • 2024
  • This study aims to identify and compare the primary research topics in domestic and international research support services through topic modeling. The analysis revealed 12 major topics in domestic studies and 15 in international ones. The findings highlight the need for in-depth research on digital technology in open access, data management, research data management in university libraries, and digital research support services. Furthermore, the need for further research has been identified to analyze specific types of digital research support services and to explore the evolving role of information professionals in research data management. This study is significant in that it comprehensively analyzes existing research and provides guidance for future research directions.

Analysis of Research Trends in Data Curation Using Text Mining Techniques (텍스트 마이닝을 활용한 국외 데이터 큐레이션 연구 동향 분석)

  • Jaeeun Choi
    • Journal of the Korean Society for information Management
    • /
    • v.41 no.3
    • /
    • pp.85-107
    • /
    • 2024
  • This study analyzes trends in data curation research. A total of 1,849 scholarly records were extracted from Scopus and WoS, with 1,797 papers selected after removing duplicates. Titles, keywords, and abstracts were analyzed through keyword frequency analysis, LDA topic modeling, and network analysis. Frequent keywords like 'research' and 'information' suggest that data curation is widely applied in medical research, biomedical research, data management, and infrastructure. LDA modeling identified five main topics: improving medical data quality, enhancing big data management, managing scientific data and repositories, annotating and modeling medical data, and gene/protein database research. Network analysis showed that 'analysis' was central in global discussions, while 'gene' and 'system' were locally central. These findings highlight the importance of data curation in various research areas.

Research Trends in e-commerce Using Topic Modeling: Focusing on SCOPUS Database (토픽 모델링을 활용한 e-commerce 연구 동향: SCOPUS DB 데이터를 중심으로)

  • Tae-Gu Kang
    • Journal of Industrial Convergence
    • /
    • v.22 no.10
    • /
    • pp.1-9
    • /
    • 2024
  • E-commerce has emerged as a key economic driver in the digital age, and the importance of the e-commerce market has been highlighted, leading to rapid expansion in related research areas. This paper analyzes the research trends on e-commerce from 1996, when e-commerce emerged and research began, to the present day. To this end, we used R and LDA topic modeling techniques and conducted a validity test on the number of topics and an analysis of the predictive value of the topic model centered on the core keyword "e-commerce" using the SCOPUS, a foreign academic database. The analysis of topics showed that ecommerce, model, study, data, and online were among the important topics. Logistics was also found to be important. In the rapidly changing and complex e-commerce market environment, it is important to respond to the diversification of business models and the establishment of a stable revenue structure to survive. As the continuous growth of the e-commerce market is predicted, the results of this study can be used as basic data for entering the e-commerce market and expanding business through countermeasures and strategies.

Application of Domain-specific Thesaurus to Construction Documents based on Flow Margin of Semantic Similarity

  • Youmin PARK;Seonghyeon MOON;Jinwoo KIM;Seokho CHI
    • International conference on construction engineering and project management
    • /
    • 2024.07a
    • /
    • pp.375-382
    • /
    • 2024
  • Large Language Models (LLMs) still encounter challenges in comprehending domain-specific expressions within construction documents. Analogous to humans acquiring unfamiliar expressions from dictionaries, language models could assimilate domain-specific expressions through the use of a thesaurus. Numerous prior studies have developed construction thesauri; however, a practical issue arises in effectively leveraging these resources for instructing language models. Given that the thesaurus primarily outlines relationships between terms without indicating their relative importance, language models may struggle in discerning which terms to retain or replace. This research aims to establish a robust framework for guiding language models using the information from the thesaurus. For instance, a term would be associated with a list of similar terms while also being included in the lists of other related terms. The relative significance among terms could be ascertained by employing similarity scores normalized according to relevance ranks. Consequently, a term exhibiting a positive margin of normalized similarity scores (termed a pivot term) could semantically replace other related terms, thereby enabling LLMs to comprehend domain-specific terms through these pivotal terms. The outcome of this research presents a practical methodology for utilizing domain-specific thesauri to train LLMs and analyze construction documents. Ongoing evaluation involves validating the accuracy of the thesaurus-applied LLM (e.g., S-BERT) in identifying similarities within construction specification provisions. This outcome holds potential for the construction industry by enhancing LLMs' understanding of construction documents and subsequently improving text mining performance and project management efficiency.

Clickstream Big Data Mining for Demographics based Digital Marketing (인구통계특성 기반 디지털 마케팅을 위한 클릭스트림 빅데이터 마이닝)

  • Park, Jiae;Cho, Yoonho
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.3
    • /
    • pp.143-163
    • /
    • 2016
  • The demographics of Internet users are the most basic and important sources for target marketing or personalized advertisements on the digital marketing channels which include email, mobile, and social media. However, it gradually has become difficult to collect the demographics of Internet users because their activities are anonymous in many cases. Although the marketing department is able to get the demographics using online or offline surveys, these approaches are very expensive, long processes, and likely to include false statements. Clickstream data is the recording an Internet user leaves behind while visiting websites. As the user clicks anywhere in the webpage, the activity is logged in semi-structured website log files. Such data allows us to see what pages users visited, how long they stayed there, how often they visited, when they usually visited, which site they prefer, what keywords they used to find the site, whether they purchased any, and so forth. For such a reason, some researchers tried to guess the demographics of Internet users by using their clickstream data. They derived various independent variables likely to be correlated to the demographics. The variables include search keyword, frequency and intensity for time, day and month, variety of websites visited, text information for web pages visited, etc. The demographic attributes to predict are also diverse according to the paper, and cover gender, age, job, location, income, education, marital status, presence of children. A variety of data mining methods, such as LSA, SVM, decision tree, neural network, logistic regression, and k-nearest neighbors, were used for prediction model building. However, this research has not yet identified which data mining method is appropriate to predict each demographic variable. Moreover, it is required to review independent variables studied so far and combine them as needed, and evaluate them for building the best prediction model. The objective of this study is to choose clickstream attributes mostly likely to be correlated to the demographics from the results of previous research, and then to identify which data mining method is fitting to predict each demographic attribute. Among the demographic attributes, this paper focus on predicting gender, age, marital status, residence, and job. And from the results of previous research, 64 clickstream attributes are applied to predict the demographic attributes. The overall process of predictive model building is compose of 4 steps. In the first step, we create user profiles which include 64 clickstream attributes and 5 demographic attributes. The second step performs the dimension reduction of clickstream variables to solve the curse of dimensionality and overfitting problem. We utilize three approaches which are based on decision tree, PCA, and cluster analysis. We build alternative predictive models for each demographic variable in the third step. SVM, neural network, and logistic regression are used for modeling. The last step evaluates the alternative models in view of model accuracy and selects the best model. For the experiments, we used clickstream data which represents 5 demographics and 16,962,705 online activities for 5,000 Internet users. IBM SPSS Modeler 17.0 was used for our prediction process, and the 5-fold cross validation was conducted to enhance the reliability of our experiments. As the experimental results, we can verify that there are a specific data mining method well-suited for each demographic variable. For example, age prediction is best performed when using the decision tree based dimension reduction and neural network whereas the prediction of gender and marital status is the most accurate by applying SVM without dimension reduction. We conclude that the online behaviors of the Internet users, captured from the clickstream data analysis, could be well used to predict their demographics, thereby being utilized to the digital marketing.

Definition and Division in Intelligent Service Facility for Integrating Management (지능화시설의 통합운영관리를 위한 정의 및 구분에 관한 연구)

  • PARK, Jeong-Woo;YIM, Du-Hyun;NAM, Kwang-Woo;KIM, Jin-Young
    • Journal of the Korean Association of Geographic Information Studies
    • /
    • v.19 no.4
    • /
    • pp.52-62
    • /
    • 2016
  • Smart City is urban development for complex problem solving that provides convenience and safety for citizens, and it is a blueprint for future cities. In 2008, the Korean government defined the construction, management, and government support of U-Cities in the legislation, Act on the Construction, Etc. of Ubiquitous Cities (Ubiquitous City Act), which included definitions of terms used in the act. In addition, the Minister of Land, Infrastructure and Transport has established a "ubiquitous city master plan" considering this legislation. The concept of U-Cities is complex, due to the mix of informatization and urban planning. Because of this complexity, the foundation of relevant regulations is inadequate, which is impeding the establishment and implementation of practical plans. Smart City intelligent service facilities are not easy to define and classify, because technology is rapidly changing and includes various devices for gathering and expressing information. The purpose of this study is to complement the legal definition of the intelligent service facility, which is necessary for integrated management and operation. The related laws and regulations on U-City were analyzed using text-mining techniques to identify insufficient legal definitions of intelligent service facilities. Using data gathered from interviews with officials responsible for constructing U-Cities, this study identified problems generated by implementing intelligent service facilities at the field level. This strategy should contribute to improved efficiency management, the foundation for building integrated utilization between departments. Efficiencies include providing a clear concept for establishing five-year renewable plans for U-Cities.

International Research Trend on Mountainous Sediment-related Disasters Induced by Earthquakes (지진 유발 산지토사재해 관련 국외 연구동향 분석)

  • Lee, Sang-In;Seo, Jung-Il;Kim, Jin-Hak;Ryu, Dong-Seop;Seo, Jun-Pyo;Kim, Dong-Yeob;Lee, Chang-Woo
    • Journal of Korean Society of Forest Science
    • /
    • v.106 no.4
    • /
    • pp.431-440
    • /
    • 2017
  • The 2016 Gyeongju Earthquake ($M_L$ 5.8) (occurred on September 12, 2016) and the 2017 Pohang Earthquake ($M_L$ 5.4) (occurred on November 15, 2017) caused unprecedented damages in South Korea. It is necessary to establish basic data related to earthquake-induced mountainous sediment-related disasters over worldwide. In this study, we analyzed previous international studies on the earthquake-induced mountainous sediment-related disasters, then classified research areas according to research themes using text-mining and co-word analysis in VOSviewer program, and finally examined spatio-temporal research trends by research area. The result showed that the related-researches have been rapidly increased since 2005, which seems to be affected by recent large-scale earthquakes occurred in China, Taiwan and Japan. In addition, the research area related to mountainous sediment-related disasters induced by earthquakes was classified into four subjects: (i) mechanisms of disaster occurrence; (ii) rainfall parameters controlling disaster occurrence; (iii) prediction of potential disaster area using aerial and satellite photographs; and (iv) disaster risk mapping through the modeling of disaster occurrence. These research areas are considered to have a strong correlation with each other. On the threshold year (i.e., 2012-2013), when cumulative number of research papers was reached 50% of total research papers published since 1987, proportions per unit year of all research areas should increase. Especially, the proportion of the research areas related to prediction of potential disaster area using aerial and satellite photographs is highly increased compared to other three research areas. These trends are responsible for the rapidly increasing research papers with study sites in China, and the research papers examined in Taiwan, Japan, and the United States have also contributed to increases in all research areas. The results are could be used as basic data to present future research direction related to mountainous sediment-related disasters induced by earthquakes in South Korea.

Keyword Network Analysis for Technology Forecasting (기술예측을 위한 특허 키워드 네트워크 분석)

  • Choi, Jin-Ho;Kim, Hee-Su;Im, Nam-Gyu
    • Journal of Intelligence and Information Systems
    • /
    • v.17 no.4
    • /
    • pp.227-240
    • /
    • 2011
  • New concepts and ideas often result from extensive recombination of existing concepts or ideas. Both researchers and developers build on existing concepts and ideas in published papers or registered patents to develop new theories and technologies that in turn serve as a basis for further development. As the importance of patent increases, so does that of patent analysis. Patent analysis is largely divided into network-based and keyword-based analyses. The former lacks its ability to analyze information technology in details while the letter is unable to identify the relationship between such technologies. In order to overcome the limitations of network-based and keyword-based analyses, this study, which blends those two methods, suggests the keyword network based analysis methodology. In this study, we collected significant technology information in each patent that is related to Light Emitting Diode (LED) through text mining, built a keyword network, and then executed a community network analysis on the collected data. The results of analysis are as the following. First, the patent keyword network indicated very low density and exceptionally high clustering coefficient. Technically, density is obtained by dividing the number of ties in a network by the number of all possible ties. The value ranges between 0 and 1, with higher values indicating denser networks and lower values indicating sparser networks. In real-world networks, the density varies depending on the size of a network; increasing the size of a network generally leads to a decrease in the density. The clustering coefficient is a network-level measure that illustrates the tendency of nodes to cluster in densely interconnected modules. This measure is to show the small-world property in which a network can be highly clustered even though it has a small average distance between nodes in spite of the large number of nodes. Therefore, high density in patent keyword network means that nodes in the patent keyword network are connected sporadically, and high clustering coefficient shows that nodes in the network are closely connected one another. Second, the cumulative degree distribution of the patent keyword network, as any other knowledge network like citation network or collaboration network, followed a clear power-law distribution. A well-known mechanism of this pattern is the preferential attachment mechanism, whereby a node with more links is likely to attain further new links in the evolution of the corresponding network. Unlike general normal distributions, the power-law distribution does not have a representative scale. This means that one cannot pick a representative or an average because there is always a considerable probability of finding much larger values. Networks with power-law distributions are therefore often referred to as scale-free networks. The presence of heavy-tailed scale-free distribution represents the fundamental signature of an emergent collective behavior of the actors who contribute to forming the network. In our context, the more frequently a patent keyword is used, the more often it is selected by researchers and is associated with other keywords or concepts to constitute and convey new patents or technologies. The evidence of power-law distribution implies that the preferential attachment mechanism suggests the origin of heavy-tailed distributions in a wide range of growing patent keyword network. Third, we found that among keywords that flew into a particular field, the vast majority of keywords with new links join existing keywords in the associated community in forming the concept of a new patent. This finding resulted in the same outcomes for both the short-term period (4-year) and long-term period (10-year) analyses. Furthermore, using the keyword combination information that was derived from the methodology suggested by our study enables one to forecast which concepts combine to form a new patent dimension and refer to those concepts when developing a new patent.

Analysis of Trends in Education Policy of STEAM Using Text Mining: Comparative Analysis of Ministry of Education's Documents, Articles, and Abstract of Researches from 2009 to 2020 (텍스트 마이닝을 활용한 융합인재교육정책 동향 분석 -2009년~2020년 교육부보도, 언론보도, 학술지 초록 비교분석-)

  • You, Jungmin;Kim, Sung-Won
    • Journal of The Korean Association For Science Education
    • /
    • v.41 no.6
    • /
    • pp.455-470
    • /
    • 2021
  • This study examines the trend changes in keywords and topics of STEAM education from 2009 to 2020 to derive future development direction and education implications. Among the collected data, 42 cases of Ministry of Education's documents, 1,534 cases of articles, and 880 cases of abstract of researches were selected as research subjects. Keyword analysis, keyword network and topic modeling were performed for each stage of STEAM education policy through the Python program. As a result of the analysis, according to the STEAM education policy stage, there were differences in the frequency and network of keywords related to STEAM education by media. It was confirmed that there was a difference in interest in STEAM education policy as there were differences in keywords and topics that were mainly used importantly by media. Most of the topics of the Ministry of Education's documents were found to correspond to topics derived from articles. The implications for the development direction of STEAM education derived from the results of this study are as follows: first, STEAM education needs to consider ways to connect multiple topics, including the humanities. Second, since the media has a difference in interest in STEAM education policy, it is necessary to seek a cooperative development direction through understanding this. Third, the Ministry of Education's support for core competency reinforcement and convergence literacy for nurturing future talents, the goal of STEAM education, and the media's efforts to increase the public's understanding of STEAM education are required. Lastly, it is necessary to continuously analyze the themes that will appear in the evaluation process and change STEAM education policy.

Analysis of Research Trends on Mountain Streams in the Republic of Korea: Comparison to International Research Trends (산지하천을 대상으로 한 국내 연구동향 분석: 국제 연구동향과의 비교)

  • Lee, Sang In;Seo, Jung Il;Lee, Yohan;Kim, Suk Woo;Chun, Kun Woo
    • Korean Journal of Environment and Ecology
    • /
    • v.33 no.2
    • /
    • pp.216-227
    • /
    • 2019
  • The purpose of this study is to propose the rational mountain stream management strategy considering the natural conditions and social needs of the Republic of Korea. We reviewed domestic and overseas studies related to mountain streams, identified the study areas by text mining and co-word analysis using the VOSviewer program, and then analyzed the spatial and temporal study trends and topics of each study area. The results showed that domestic studies on mountain streams are still in an initial stage compared to overseas studies. Overseas studies on mountain streams can be classified into four groups: (i) habitat and species composition of fish and invertebrates, (ii) hydrological phenomena and nutrient migration, (iii) transport of sediment and organic materials and the relevant morphological changes by runoff flows, and (iv) plant species composition in mountain streams. Of these study subjects, domestic studies belonging to the (i) group mainly focused on macroinvertebrates while domestic studies belonging to the (iii) group regarded transport of sediment and organic materials as not the ecological disturbance but the source of sediment-related disasters. We then analyzed the rate of each research group to all papers by period and country. The results showed that the overseas studies belonging to (iii) and (iv) groups have increased with time, and the increase was mostly due to the studies in the United States, Brazil, Canada, and China. On the other hand, domestic studies belonging to (i) and (iii) groups increased somewhat with time, but there was a slight lack of correlation between the two subjects. Therefore, the hybridity studies to complement the shortage is necessary for the future.