• Title/Summary/Keyword: Data mining analysis

Search Result 2,192, Processing Time 0.036 seconds

A Study on the Research Trends for Smart City using Topic Modeling (토픽 모델링을 활용한 스마트시티 연구동향 분석)

  • Park, Keon Chul;Lee, Chi Hyung
    • Journal of Internet Computing and Services
    • /
    • v.20 no.3
    • /
    • pp.119-128
    • /
    • 2019
  • This study aims to analyze the research trends on Smart City and to present implications to policy maker, industry professional, and researcher. Cities around globe have undergone the rapid progress in urbanization and the consequent dramatic increase in urban dwellings over the past few decades, and faced many urban problems in such areas as transportation, environment and housing. Cities around the globe are in a hurry to introduce Smart City to pursue a common goal of solving these urban problems and improving the quality of their lives. However, various conceptual approaches to smart city are causing uncertainty in setting policy goals and establishing direction for implementation. The study collected 11,527 papers titled "Smart City(cities)" from the Scopus DB and Springer DB, and then analyze research status, topic, trends based on abstracts and publication date(year) information using the LDA based Topic Modeling approaches. Research topics are classified into three categories(Services, Technologies, and User Perspective) and eight regarding topics. Out of eight topics, citizen-driven innovation is the most frequently referred. Additional topic network analysis reveals that data and privacy/security are the most prevailing topics affecting others. This study is expected to helps understand the trends of Smart City researches and predict the future researches.

Analysis of Technology Association Rules Between CPC Codes of the 'Internet of Things(IoT)' Patent (CPC 코드 기반 사물인터넷(IoT) 특허의 기술 연관성 규칙 분석)

  • Shim, Jaeruen
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.12 no.5
    • /
    • pp.493-498
    • /
    • 2019
  • This study deals with the analysis of the technology association rules between CPC codes of the Internet of Things(IoT) patent, the core of the Fourth Industrial Revolution ICT-based technology. The association rules between CPC codes were extracted using R, an open source for data mining. To this end, we analyzed 369 of the 605 patents related to the Internet of Things filed with the Patent Office until July 2019, with a complex CPC code, up to the subclass-level. As a result of the technology association rules, CPC codes with high support were [H04W ${\rightarrow}$ H04L](18.2%), [H04L ${\rightarrow}$ H04W](18.2%), [G06Q ${\rightarrow}$ H04L](17.3%), [H04L ${\rightarrow}$ G06Q](17.3%), [H04W ${\rightarrow}$ G06Q](9.8%), [G06Q ${\rightarrow}$ H04W](9.8%), [G06F ${\rightarrow}$ H04L](7.9%), [H04L ${\rightarrow}$ G06F](7.9%), [G06F ${\rightarrow}$ G06Q](6.2%), [G06Q ${\rightarrow}$ G06F](6.2%). After analyzing the technology interconnection network, the core CPC codes related to technology association rules are G06Q and H04L. The results of this study can be used to predict future patent trends.

A study on the number of passengers using the subway stations in Seoul (데이터마이닝 기법을 이용한 서울시 지하철역 승차인원 예측)

  • Cho, Soojin;Kim, Bogyeong;Kim, Nahyun;Song, Jongwoo
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.1
    • /
    • pp.111-128
    • /
    • 2019
  • Subways are eco-friendly public transportation that can transport large numbers of passengers safely and quickly. It is necessary to predict the accurate number of passengers in order to increase public interest in subway. This study groups stations on Lines 1 to 9 of the Seoul Metropolitan Subway using clustering analysis. We propose one final prediction model for all stations and three optimal prediction models for each cluster. We found three groups of stations out of 294 total subway stations. The Group 1 area is industrial and commercial, the Group 2 ares is residential and commercial, and the Group 3 area is residential districts. Various data mining techniques were conducted for each group, as well as driving some influential factors on demand prediction. We use our model to predict the number of passengers for 8 new stations which are part of the 3rd extension plan of Seoul metro line 9 opened in October 2018. The estimated average number of passengers per hour is from 241 to 452 and the estimated maximum number of passengers per hour is from 969 to 1515. We believe our analysis can help improve the efficiency of public transportation policy.

Research on Trends in International Research Cooperation through Analysis of International Research Cooperation Books (국내외 단행본 분석을 통한 국제연구협력 동향 연구)

  • Noh, Younghee;Kwak, Woojung
    • The Journal of the Korea Contents Association
    • /
    • v.22 no.6
    • /
    • pp.35-44
    • /
    • 2022
  • In this study, we tried to confirm the characteristics of books published on the topic of international cooperation, what kind of international cooperation-related research is being conducted through this book, and what are the main contents of international cooperation. In order to achieve this research purpose, we conducted data construction, statistical analysis, and text mining based on textom in international research cooperation at home and abroad. As a result of the study, it can be seen that there has been a particularly high interest in international research and international cooperation since the 2010s. Through this, it was found that he is interested in development, economy, technology, development, region, and relations and wants to promote development. In addition, topics such as environment, trade, education, and society appeared, and interest in international research cooperation centered on environment, trade, and education was high, was found to have a high influence on society as a whole. Through this study, we find the research significance in that it can serve as a basic research to confirm the characteristics of some national and public research institutes participating in international research cooperation, and that it confirms the trend of participating in international research cooperation in a relatively specific type of institution. can see.

Analysis on Results and Changes in Recent Forecasting of Earthquake and Space Technologies in Korea and Japan (한국과 일본의 지진재해 및 우주이용 기술예측에 대한 최근의 변화 분석)

  • Ahn, Eun-Young
    • Economic and Environmental Geology
    • /
    • v.55 no.4
    • /
    • pp.421-428
    • /
    • 2022
  • This study analyzes emerging earthquake and space use technologies from the latest Korean and Japanese scientific and technological foresights in 2022 and 2019, respectively. Unlike the earthquake prediction and early warning technologies presented in the 2017 study, the emerging earthquake technologies in 2022 in Korea was described as an earthquake/complex disaster information technology and public data platform. Many detailed future technologies were presented in Japan's 2019 survey, which includes largescale earthquake prediction, induced earthquake, national liquefaction risk, wide-scale stress measurement; and monitoring by Internet of Things (IoT) or artificial intelligence (AI) observation & analysis. The latest emerging space use technology in Korea and Japan were presented in more detail as robotic mining technology for water/ice, Helium-3, and rare earth metals, and manned station technology that utilizes local resources on the moon and Mars. The technological realization year forecasting in 2019 was delayed by 4-10 years from the prediction in 2015, which could be greater due to the Corona 19 epidemic, the declaration of carbon neutrality in Korea and Japan in 2020 and the Russo-Ukrainian War in 2022. However, it is required to more active research on earthquake and space technologies linked to information technology.

Selective Word Embedding for Sentence Classification by Considering Information Gain and Word Similarity (문장 분류를 위한 정보 이득 및 유사도에 따른 단어 제거와 선택적 단어 임베딩 방안)

  • Lee, Min Seok;Yang, Seok Woo;Lee, Hong Joo
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.4
    • /
    • pp.105-122
    • /
    • 2019
  • Dimensionality reduction is one of the methods to handle big data in text mining. For dimensionality reduction, we should consider the density of data, which has a significant influence on the performance of sentence classification. It requires lots of computations for data of higher dimensions. Eventually, it can cause lots of computational cost and overfitting in the model. Thus, the dimension reduction process is necessary to improve the performance of the model. Diverse methods have been proposed from only lessening the noise of data like misspelling or informal text to including semantic and syntactic information. On top of it, the expression and selection of the text features have impacts on the performance of the classifier for sentence classification, which is one of the fields of Natural Language Processing. The common goal of dimension reduction is to find latent space that is representative of raw data from observation space. Existing methods utilize various algorithms for dimensionality reduction, such as feature extraction and feature selection. In addition to these algorithms, word embeddings, learning low-dimensional vector space representations of words, that can capture semantic and syntactic information from data are also utilized. For improving performance, recent studies have suggested methods that the word dictionary is modified according to the positive and negative score of pre-defined words. The basic idea of this study is that similar words have similar vector representations. Once the feature selection algorithm selects the words that are not important, we thought the words that are similar to the selected words also have no impacts on sentence classification. This study proposes two ways to achieve more accurate classification that conduct selective word elimination under specific regulations and construct word embedding based on Word2Vec embedding. To select words having low importance from the text, we use information gain algorithm to measure the importance and cosine similarity to search for similar words. First, we eliminate words that have comparatively low information gain values from the raw text and form word embedding. Second, we select words additionally that are similar to the words that have a low level of information gain values and make word embedding. In the end, these filtered text and word embedding apply to the deep learning models; Convolutional Neural Network and Attention-Based Bidirectional LSTM. This study uses customer reviews on Kindle in Amazon.com, IMDB, and Yelp as datasets, and classify each data using the deep learning models. The reviews got more than five helpful votes, and the ratio of helpful votes was over 70% classified as helpful reviews. Also, Yelp only shows the number of helpful votes. We extracted 100,000 reviews which got more than five helpful votes using a random sampling method among 750,000 reviews. The minimal preprocessing was executed to each dataset, such as removing numbers and special characters from text data. To evaluate the proposed methods, we compared the performances of Word2Vec and GloVe word embeddings, which used all the words. We showed that one of the proposed methods is better than the embeddings with all the words. By removing unimportant words, we can get better performance. However, if we removed too many words, it showed that the performance was lowered. For future research, it is required to consider diverse ways of preprocessing and the in-depth analysis for the co-occurrence of words to measure similarity values among words. Also, we only applied the proposed method with Word2Vec. Other embedding methods such as GloVe, fastText, ELMo can be applied with the proposed methods, and it is possible to identify the possible combinations between word embedding methods and elimination methods.

An Intelligent Decision Support System for Selecting Promising Technologies for R&D based on Time-series Patent Analysis (R&D 기술 선정을 위한 시계열 특허 분석 기반 지능형 의사결정지원시스템)

  • Lee, Choongseok;Lee, Suk Joo;Choi, Byounggu
    • Journal of Intelligence and Information Systems
    • /
    • v.18 no.3
    • /
    • pp.79-96
    • /
    • 2012
  • As the pace of competition dramatically accelerates and the complexity of change grows, a variety of research have been conducted to improve firms' short-term performance and to enhance firms' long-term survival. In particular, researchers and practitioners have paid their attention to identify promising technologies that lead competitive advantage to a firm. Discovery of promising technology depends on how a firm evaluates the value of technologies, thus many evaluating methods have been proposed. Experts' opinion based approaches have been widely accepted to predict the value of technologies. Whereas this approach provides in-depth analysis and ensures validity of analysis results, it is usually cost-and time-ineffective and is limited to qualitative evaluation. Considerable studies attempt to forecast the value of technology by using patent information to overcome the limitation of experts' opinion based approach. Patent based technology evaluation has served as a valuable assessment approach of the technological forecasting because it contains a full and practical description of technology with uniform structure. Furthermore, it provides information that is not divulged in any other sources. Although patent information based approach has contributed to our understanding of prediction of promising technologies, it has some limitations because prediction has been made based on the past patent information, and the interpretations of patent analyses are not consistent. In order to fill this gap, this study proposes a technology forecasting methodology by integrating patent information approach and artificial intelligence method. The methodology consists of three modules : evaluation of technologies promising, implementation of technologies value prediction model, and recommendation of promising technologies. In the first module, technologies promising is evaluated from three different and complementary dimensions; impact, fusion, and diffusion perspectives. The impact of technologies refers to their influence on future technologies development and improvement, and is also clearly associated with their monetary value. The fusion of technologies denotes the extent to which a technology fuses different technologies, and represents the breadth of search underlying the technology. The fusion of technologies can be calculated based on technology or patent, thus this study measures two types of fusion index; fusion index per technology and fusion index per patent. Finally, the diffusion of technologies denotes their degree of applicability across scientific and technological fields. In the same vein, diffusion index per technology and diffusion index per patent are considered respectively. In the second module, technologies value prediction model is implemented using artificial intelligence method. This studies use the values of five indexes (i.e., impact index, fusion index per technology, fusion index per patent, diffusion index per technology and diffusion index per patent) at different time (e.g., t-n, t-n-1, t-n-2, ${\cdots}$) as input variables. The out variables are values of five indexes at time t, which is used for learning. The learning method adopted in this study is backpropagation algorithm. In the third module, this study recommends final promising technologies based on analytic hierarchy process. AHP provides relative importance of each index, leading to final promising index for technology. Applicability of the proposed methodology is tested by using U.S. patents in international patent class G06F (i.e., electronic digital data processing) from 2000 to 2008. The results show that mean absolute error value for prediction produced by the proposed methodology is lower than the value produced by multiple regression analysis in cases of fusion indexes. However, mean absolute error value of the proposed methodology is slightly higher than the value of multiple regression analysis. These unexpected results may be explained, in part, by small number of patents. Since this study only uses patent data in class G06F, number of sample patent data is relatively small, leading to incomplete learning to satisfy complex artificial intelligence structure. In addition, fusion index per technology and impact index are found to be important criteria to predict promising technology. This study attempts to extend the existing knowledge by proposing a new methodology for prediction technology value by integrating patent information analysis and artificial intelligence network. It helps managers who want to technology develop planning and policy maker who want to implement technology policy by providing quantitative prediction methodology. In addition, this study could help other researchers by proving a deeper understanding of the complex technological forecasting field.

Mapping Categories of Heterogeneous Sources Using Text Analytics (텍스트 분석을 통한 이종 매체 카테고리 다중 매핑 방법론)

  • Kim, Dasom;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.4
    • /
    • pp.193-215
    • /
    • 2016
  • In recent years, the proliferation of diverse social networking services has led users to use many mediums simultaneously depending on their individual purpose and taste. Besides, while collecting information about particular themes, they usually employ various mediums such as social networking services, Internet news, and blogs. However, in terms of management, each document circulated through diverse mediums is placed in different categories on the basis of each source's policy and standards, hindering any attempt to conduct research on a specific category across different kinds of sources. For example, documents containing content on "Application for a foreign travel" can be classified into "Information Technology," "Travel," or "Life and Culture" according to the peculiar standard of each source. Likewise, with different viewpoints of definition and levels of specification for each source, similar categories can be named and structured differently in accordance with each source. To overcome these limitations, this study proposes a plan for conducting category mapping between different sources with various mediums while maintaining the existing category system of the medium as it is. Specifically, by re-classifying individual documents from the viewpoint of diverse sources and storing the result of such a classification as extra attributes, this study proposes a logical layer by which users can search for a specific document from multiple heterogeneous sources with different category names as if they belong to the same source. Besides, by collecting 6,000 articles of news from two Internet news portals, experiments were conducted to compare accuracy among sources, supervised learning and semi-supervised learning, and homogeneous and heterogeneous learning data. It is particularly interesting that in some categories, classifying accuracy of semi-supervised learning using heterogeneous learning data proved to be higher than that of supervised learning and semi-supervised learning, which used homogeneous learning data. This study has the following significances. First, it proposes a logical plan for establishing a system to integrate and manage all the heterogeneous mediums in different classifying systems while maintaining the existing physical classifying system as it is. This study's results particularly exhibit very different classifying accuracies in accordance with the heterogeneity of learning data; this is expected to spur further studies for enhancing the performance of the proposed methodology through the analysis of characteristics by category. In addition, with an increasing demand for search, collection, and analysis of documents from diverse mediums, the scope of the Internet search is not restricted to one medium. However, since each medium has a different categorical structure and name, it is actually very difficult to search for a specific category insofar as encompassing heterogeneous mediums. The proposed methodology is also significant for presenting a plan that enquires into all the documents regarding the standards of the relevant sites' categorical classification when the users select the desired site, while maintaining the existing site's characteristics and structure as it is. This study's proposed methodology needs to be further complemented in the following aspects. First, though only an indirect comparison and evaluation was made on the performance of this proposed methodology, future studies would need to conduct more direct tests on its accuracy. That is, after re-classifying documents of the object source on the basis of the categorical system of the existing source, the extent to which the classification was accurate needs to be verified through evaluation by actual users. In addition, the accuracy in classification needs to be increased by making the methodology more sophisticated. Furthermore, an understanding is required that the characteristics of some categories that showed a rather higher classifying accuracy of heterogeneous semi-supervised learning than that of supervised learning might assist in obtaining heterogeneous documents from diverse mediums and seeking plans that enhance the accuracy of document classification through its usage.

A Study on the Application of Outlier Analysis for Fraud Detection: Focused on Transactions of Auction Exception Agricultural Products (부정 탐지를 위한 이상치 분석 활용방안 연구 : 농수산 상장예외품목 거래를 대상으로)

  • Kim, Dongsung;Kim, Kitae;Kim, Jongwoo;Park, Steve
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.3
    • /
    • pp.93-108
    • /
    • 2014
  • To support business decision making, interests and efforts to analyze and use transaction data in different perspectives are increasing. Such efforts are not only limited to customer management or marketing, but also used for monitoring and detecting fraud transactions. Fraud transactions are evolving into various patterns by taking advantage of information technology. To reflect the evolution of fraud transactions, there are many efforts on fraud detection methods and advanced application systems in order to improve the accuracy and ease of fraud detection. As a case of fraud detection, this study aims to provide effective fraud detection methods for auction exception agricultural products in the largest Korean agricultural wholesale market. Auction exception products policy exists to complement auction-based trades in agricultural wholesale market. That is, most trades on agricultural products are performed by auction; however, specific products are assigned as auction exception products when total volumes of products are relatively small, the number of wholesalers is small, or there are difficulties for wholesalers to purchase the products. However, auction exception products policy makes several problems on fairness and transparency of transaction, which requires help of fraud detection. In this study, to generate fraud detection rules, real huge agricultural products trade transaction data from 2008 to 2010 in the market are analyzed, which increase more than 1 million transactions and 1 billion US dollar in transaction volume. Agricultural transaction data has unique characteristics such as frequent changes in supply volumes and turbulent time-dependent changes in price. Since this was the first trial to identify fraud transactions in this domain, there was no training data set for supervised learning. So, fraud detection rules are generated using outlier detection approach. We assume that outlier transactions have more possibility of fraud transactions than normal transactions. The outlier transactions are identified to compare daily average unit price, weekly average unit price, and quarterly average unit price of product items. Also quarterly averages unit price of product items of the specific wholesalers are used to identify outlier transactions. The reliability of generated fraud detection rules are confirmed by domain experts. To determine whether a transaction is fraudulent or not, normal distribution and normalized Z-value concept are applied. That is, a unit price of a transaction is transformed to Z-value to calculate the occurrence probability when we approximate the distribution of unit prices to normal distribution. The modified Z-value of the unit price in the transaction is used rather than using the original Z-value of it. The reason is that in the case of auction exception agricultural products, Z-values are influenced by outlier fraud transactions themselves because the number of wholesalers is small. The modified Z-values are called Self-Eliminated Z-scores because they are calculated excluding the unit price of the specific transaction which is subject to check whether it is fraud transaction or not. To show the usefulness of the proposed approach, a prototype of fraud transaction detection system is developed using Delphi. The system consists of five main menus and related submenus. First functionalities of the system is to import transaction databases. Next important functions are to set up fraud detection parameters. By changing fraud detection parameters, system users can control the number of potential fraud transactions. Execution functions provide fraud detection results which are found based on fraud detection parameters. The potential fraud transactions can be viewed on screen or exported as files. The study is an initial trial to identify fraud transactions in Auction Exception Agricultural Products. There are still many remained research topics of the issue. First, the scope of analysis data was limited due to the availability of data. It is necessary to include more data on transactions, wholesalers, and producers to detect fraud transactions more accurately. Next, we need to extend the scope of fraud transaction detection to fishery products. Also there are many possibilities to apply different data mining techniques for fraud detection. For example, time series approach is a potential technique to apply the problem. Even though outlier transactions are detected based on unit prices of transactions, however it is possible to derive fraud detection rules based on transaction volumes.

The Empirical Study on the Effect of Technology Exchanges in the Fourth Industrial Revolution between Korea and China: Focused on the Firm Social Network Analysis (한중 4차산업혁명 기술교류 및 효과에 대한 실증연구: 기업 소셜 네트워크 분석 중심으로)

  • Zhou, Zhenxin;Sohn, Kwonsang;Hwang, Yoon Min;Kwon, Ohbyung
    • The Journal of Society for e-Business Studies
    • /
    • v.25 no.3
    • /
    • pp.41-61
    • /
    • 2020
  • China's rapid development and commercialization of high-tech technologies in the fourth industrial revolution has led to effective technology exchanges between Korean and Chinese firms becoming more important to Korea's mid-term and long-term industrial development. However, there is still a lack of empirical research on how technology exchanges between Korean and Chinese firms proceed and their effectiveness. In response, this study conducted a social network analysis based on text mining data of Korea-China business technology exchange and cooperation articles introduced in the news from 2018 to March 2020 on the current status and effects of Korea-China technology exchanges related to the fourth industrial revolution, and conducted a regression analysis how network centrality effect on the firm performance. According to the results, most of the Korean major electronic firms are actively networking with Chinese firms and institutions, showing high centrality in the centrality index. Korean telecommunication firms showed high betweenness centrality and subgraph centrality, and Korean Internet service providers and broadcasting contents firms showed high eigenvector centrality. In addition, Chinese firms showed higher betweenness centrality than Korean firms, and Chinese service firms showed higher closeness centrality than manufacturing firms. As a result of regression analysis, this network centrality had a positive effect on firm performance. To the best of our knowledge, this is the first to analyze the impact of the technical cooperation between Korean and Chinese firms under the fourth industrial revolution context. This study has theoretical implications that suggested the direction of social network analysis-based empirical research in global firm cooperation. Also, this study has practical implications that the guidelines for network analysis in setting the direction of technical cooperation between Korea and China by firms or governments.