• Title/Summary/Keyword: machine learning

Search Result 5,209, Processing Time 0.038 seconds

Improvement of precipitation forecasting skill of ECMWF data using multi-layer perceptron technique (다층퍼셉트론 기법을 이용한 ECMWF 예측자료의 강수예측 정확도 향상)

  • Lee, Seungsoo;Kim, Gayoung;Yoon, Soonjo;An, Hyunuk
    • Journal of Korea Water Resources Association
    • /
    • v.52 no.7
    • /
    • pp.475-482
    • /
    • 2019
  • Subseasonal-to-Seasonal (S2S) prediction information which have 2 weeks to 2 months lead time are expected to be used through many parts of industry fields, but utilizability is not reached to expectation because of lower predictability than weather forecast and mid- /long-term forecast. In this study, we used multi-layer perceptron (MLP) which is one of machine learning technique that was built for regression training in order to improve predictability of S2S precipitation data at South Korea through post-processing. Hindcast information of ECMWF was used for MLP training and the original data were compared with trained outputs based on dichotomous forecast technique. As a result, Bias score, accuracy, and Critical Success Index (CSI) of trained output were improved on average by 59.7%, 124.3% and 88.5%, respectively. Probability of detection (POD) score was decreased on average by 9.5% and the reason was analyzed that ECMWF's model excessively predicted precipitation days. In this study, we confirmed that predictability of ECMWF's S2S information can be improved by post-processing using MLP even the predictability of original data was low. The results of this study can be used to increase the capability of S2S information in water resource and agricultural fields.

Prediction Model of User Physical Activity using Data Characteristics-based Long Short-term Memory Recurrent Neural Networks

  • Kim, Joo-Chang;Chung, Kyungyong
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.13 no.4
    • /
    • pp.2060-2077
    • /
    • 2019
  • Recently, mobile healthcare services have attracted significant attention because of the emerging development and supply of diverse wearable devices. Smartwatches and health bands are the most common type of mobile-based wearable devices and their market size is increasing considerably. However, simple value comparisons based on accumulated data have revealed certain problems, such as the standardized nature of health management and the lack of personalized health management service models. The convergence of information technology (IT) and biotechnology (BT) has shifted the medical paradigm from continuous health management and disease prevention to the development of a system that can be used to provide ground-based medical services regardless of the user's location. Moreover, the IT-BT convergence has necessitated the development of lifestyle improvement models and services that utilize big data analysis and machine learning to provide mobile healthcare-based personal health management and disease prevention information. Users' health data, which are specific as they change over time, are collected by different means according to the users' lifestyle and surrounding circumstances. In this paper, we propose a prediction model of user physical activity that uses data characteristics-based long short-term memory (DC-LSTM) recurrent neural networks (RNNs). To provide personalized services, the characteristics and surrounding circumstances of data collectable from mobile host devices were considered in the selection of variables for the model. The data characteristics considered were ease of collection, which represents whether or not variables are collectable, and frequency of occurrence, which represents whether or not changes made to input values constitute significant variables in terms of activity. The variables selected for providing personalized services were activity, weather, temperature, mean daily temperature, humidity, UV, fine dust, asthma and lung disease probability index, skin disease probability index, cadence, travel distance, mean heart rate, and sleep hours. The selected variables were classified according to the data characteristics. To predict activity, an LSTM RNN was built that uses the classified variables as input data and learns the dynamic characteristics of time series data. LSTM RNNs resolve the vanishing gradient problem that occurs in existing RNNs. They are classified into three different types according to data characteristics and constructed through connections among the LSTMs. The constructed neural network learns training data and predicts user activity. To evaluate the proposed model, the root mean square error (RMSE) was used in the performance evaluation of the user physical activity prediction method for which an autoregressive integrated moving average (ARIMA) model, a convolutional neural network (CNN), and an RNN were used. The results show that the proposed DC-LSTM RNN method yields an excellent mean RMSE value of 0.616. The proposed method is used for predicting significant activity considering the surrounding circumstances and user status utilizing the existing standardized activity prediction services. It can also be used to predict user physical activity and provide personalized healthcare based on the data collectable from mobile host devices.

An Analysis on Determinants of the Capesize Freight Rate and Forecasting Models (케이프선 시장 운임의 결정요인 및 운임예측 모형 분석)

  • Lim, Sang-Seop;Yun, Hee-Sung
    • Journal of Navigation and Port Research
    • /
    • v.42 no.6
    • /
    • pp.539-545
    • /
    • 2018
  • In recent years, research on shipping market forecasting with the employment of non-linear AI models has attracted significant interest. In previous studies, input variables were selected with reference to past papers or by relying on the intuitions of the researchers. This paper attempts to address this issue by applying the stepwise regression model and the random forest model to the Cape-size bulk carrier market. The Cape market was selected due to the simplicity of its supply and demand structure. The preliminary selection of the determinants resulted in 16 variables. In the next stage, 8 features from the stepwise regression model and 10 features from the random forest model were screened as important determinants. The chosen variables were used to test both models. Based on the analysis of the models, it was observed that the random forest model outperforms the stepwise regression model. This research is significant because it provides a scientific basis which can be used to find the determinants in shipping market forecasting, and utilize a machine-learning model in the process. The results of this research can be used to enhance the decisions of chartering desks by offering a guideline for market analysis.

Investigating Opinion Mining Performance by Combining Feature Selection Methods with Word Embedding and BOW (Bag-of-Words) (속성선택방법과 워드임베딩 및 BOW (Bag-of-Words)를 결합한 오피니언 마이닝 성과에 관한 연구)

  • Eo, Kyun Sun;Lee, Kun Chang
    • Journal of Digital Convergence
    • /
    • v.17 no.2
    • /
    • pp.163-170
    • /
    • 2019
  • Over the past decade, the development of the Web explosively increased the data. Feature selection step is an important step in extracting valuable data from a large amount of data. This study proposes a novel opinion mining model based on combining feature selection (FS) methods with Word embedding to vector (Word2vec) and BOW (Bag-of-words). FS methods adopted for this study are CFS (Correlation based FS) and IG (Information Gain). To select an optimal FS method, a number of classifiers ranging from LR (logistic regression), NN (neural network), NBN (naive Bayesian network) to RF (random forest), RS (random subspace), ST (stacking). Empirical results with electronics and kitchen datasets showed that LR and ST classifiers combined with IG applied to BOW features yield best performance in opinion mining. Results with laptop and restaurant datasets revealed that the RF classifier using IG applied to Word2vec features represents best performance in opinion mining.

Landslide Susceptibility Prediction using Evidential Belief Function, Weight of Evidence and Artificial Neural Network Models (Evidential Belief Function, Weight of Evidence 및 Artificial Neural Network 모델을 이용한 산사태 공간 취약성 예측 연구)

  • Lee, Saro;Oh, Hyun-Joo
    • Korean Journal of Remote Sensing
    • /
    • v.35 no.2
    • /
    • pp.299-316
    • /
    • 2019
  • The purpose of this study was to analyze landslide susceptibility in the Pyeongchang area using Weight of Evidence (WOE) and Evidential Belief Function (EBF) as probability models and Artificial Neural Networks (ANN) as a machine learning model in a geographic information system (GIS). This study examined the widespread shallow landslides triggered by heavy rainfall during Typhoon Ewiniar in 2006, which caused serious property damage and significant loss of life. For the landslide susceptibility mapping, 3,955 landslide occurrences were detected using aerial photographs, and environmental spatial data such as terrain, geology, soil, forest, and land use were collected and constructed in a spatial database. Seventeen factors that could affect landsliding were extracted from the spatial database. All landslides were randomly separated into two datasets, a training set (50%) and validation set (50%), to establish and validate the EBF, WOE, and ANN models. According to the validation results of the area under the curve (AUC) method, the accuracy was 74.73%, 75.03%, and 70.87% for WOE, EBF, and ANN, respectively. The EBF model had the highest accuracy. However, all models had predictive accuracy exceeding 70%, the level that is effective for landslide susceptibility mapping. These models can be applied to predict landslide susceptibility in an area where landslides have not occurred previously based on the relationships between landslide and environmental factors. This susceptibility map can help reduce landslide risk, provide guidance for policy and land use development, and save time and expense for landslide hazard prevention. In the future, more generalized models should be developed by applying landslide susceptibility mapping in various areas.

Causal inference from nonrandomized data: key concepts and recent trends (비실험 자료로부터의 인과 추론: 핵심 개념과 최근 동향)

  • Choi, Young-Geun;Yu, Donghyeon
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.2
    • /
    • pp.173-185
    • /
    • 2019
  • Causal questions are prevalent in scientific research, for example, how effective a treatment was for preventing an infectious disease, how much a policy increased utility, or which advertisement would give the highest click rate for a given customer. Causal inference theory in statistics interprets those questions as inferring the effect of a given intervention (treatment or policy) in the data generating process. Causal inference has been used in medicine, public health, and economics; in addition, it has received recent attention as a tool for data-driven decision making processes. Many recent datasets are observational, rather than experimental, which makes the causal inference theory more complex. This review introduces key concepts and recent trends of statistical causal inference in observational studies. We first introduce the Neyman-Rubin's potential outcome framework to formularize from causal questions to average treatment effects as well as discuss popular methods to estimate treatment effects such as propensity score approaches and regression approaches. For recent trends, we briefly discuss (1) conditional (heterogeneous) treatment effects and machine learning-based approaches, (2) curse of dimensionality on the estimation of treatment effect and its remedies, and (3) Pearl's structural causal model to deal with more complex causal relationships and its connection to the Neyman-Rubin's potential outcome model.

Application of Google Search Queries for Predicting the Unemployment Rate for Koreans in Their 30s and 40s (한국 30~40대 실업률 예측을 위한 구글 검색 정보의 활용)

  • Jung, Jae Un;Hwang, Jinho
    • Journal of Digital Convergence
    • /
    • v.17 no.9
    • /
    • pp.135-145
    • /
    • 2019
  • Prolonged recession has caused the youth unemployment rate in Korea to remain at a high level of approximately 10% for years. Recently, the number of unemployed Koreans in their 30s and 40s has shown an upward trend. To expand the government's employment promotion and unemployment benefits from youth-centered policies to diverse age groups, including people in their 30s and 40s, prediction models for different age groups are required. Thus, we aimed to develop unemployment prediction models for specific age groups (30s and 40s) using available unemployment rates provided by Statistics Korea and Google search queries related to them. We first estimated multiple linear regressions (Model 1) using seasonal autoregressive integrated moving average approach with relevant unemployment rates. Then, we introduced Google search queries to obtain improved models (Model 2). For both groups, consequently, Model 2 additionally using web queries outperformed Model 1 during training and predictive periods. This result indicates that a web search query is still significant to improve the unemployment predictive models for Koreans. For practical application, this study needs to be furthered but will contribute to obtaining age-wise unemployment predictions.

Study on Anomaly Detection Method of Improper Foods using Import Food Big data (수입식품 빅데이터를 이용한 부적합식품 탐지 시스템에 관한 연구)

  • Cho, Sanggoo;Choi, Gyunghyun
    • The Journal of Bigdata
    • /
    • v.3 no.2
    • /
    • pp.19-33
    • /
    • 2018
  • Owing to the increase of FTA, food trade, and versatile preferences of consumers, food import has increased at tremendous rate every year. While the inspection check of imported food accounts for about 20% of the total food import, the budget and manpower necessary for the government's import inspection control is reaching its limit. The sudden import food accidents can cause enormous social and economic losses. Therefore, predictive system to forecast the compliance of food import with its preemptive measures will greatly improve the efficiency and effectiveness of import safety control management. There has already been a huge data accumulated from the past. The processed foods account for 75% of the total food import in the import food sector. The analysis of big data and the application of analytical techniques are also used to extract meaningful information from a large amount of data. Unfortunately, not many studies have been done regarding analyzing the import food and its implication with understanding the big data of food import. In this context, this study applied a variety of classification algorithms in the field of machine learning and suggested a data preprocessing method through the generation of new derivative variables to improve the accuracy of the model. In addition, the present study compared the performance of the predictive classification algorithms with the general base classifier. The Gaussian Naïve Bayes prediction model among various base classifiers showed the best performance to detect and predict the nonconformity of imported food. In the future, it is expected that the application of the abnormality detection model using the Gaussian Naïve Bayes. The predictive model will reduce the burdens of the inspection of import food and increase the non-conformity rate, which will have a great effect on the efficiency of the food import safety control and the speed of import customs clearance.

Construction of a Bark Dataset for Automatic Tree Identification and Developing a Convolutional Neural Network-based Tree Species Identification Model (수목 동정을 위한 수피 분류 데이터셋 구축과 합성곱 신경망 기반 53개 수종의 동정 모델 개발)

  • Kim, Tae Kyung;Baek, Gyu Heon;Kim, Hyun Seok
    • Journal of Korean Society of Forest Science
    • /
    • v.110 no.2
    • /
    • pp.155-164
    • /
    • 2021
  • Many studies have been conducted on developing automatic plant identification algorithms using machine learning to various plant features, such as leaves and flowers. Unlike other plant characteristics, barks show only little change regardless of the season and are maintained for a long period. Nevertheless, barks show a complex shape with a large variation depending on the environment, and there are insufficient materials that can be utilized to train algorithms. Here, in addition to the previously published bark image dataset, BarkNet v.1.0, images of barks were collected, and a dataset consisting of 53 tree species that can be easily observed in Korea was presented. A convolutional neural network (CNN) was trained and tested on the dataset, and the factors that interfere with the model's performance were identified. For CNN architecture, VGG-16 and 19 were utilized. As a result, VGG-16 achieved 90.41% and VGG-19 achieved 92.62% accuracy. When tested on new tree images that do not exist in the original dataset but belong to the same genus or family, it was confirmed that more than 80% of cases were successfully identified as the same genus or family. Meanwhile, it was found that the model tended to misclassify when there were distracting features in the image, including leaves, mosses, and knots. In these cases, we propose that random cropping and classification by majority votes are valid for improving possible errors in training and inferences.

Spatial Conservation Prioritization Considering Development Impacts and Habitat Suitability of Endangered Species (개발영향과 멸종위기종의 서식적합성을 고려한 보전 우선순위 선정)

  • Mo, Yongwon
    • Korean Journal of Environment and Ecology
    • /
    • v.35 no.2
    • /
    • pp.193-203
    • /
    • 2021
  • As endangered species are gradually increasing due to land development by humans, it is essential to secure sufficient protected areas (PAs) proactively. Therefore, this study checked priority conservation areas to select candidate PAs when considering the impact of land development. We determined the conservation priorities by analyzing four scenarios based on existing conservation areas and reflecting the development impact using MARXAN, the decision-making support software for the conservation plan. The development impact was derived using the developed area ratio, population density, road network system, and traffic volume. The conservation areas of endangered species were derived using the data of the appearance points of birds, mammals, and herptiles from the 3rd National Ecosystem Survey. These two factors were used as input data to map conservation priority areas with the machine learning-based optimization methodology. The result identified many non-PAs areas that were expected to play an important role conserving endangered species. When considering the land development impact, it was found that the areas with priority for conservation were fragmented. Even when both the development impact and existing PAs were considered, the priority was higher in areas from the current PAs because many road developments had already been completed around the current PAs. Therefore, it is necessary to consider areas other than the current PAs to protect endangered species and seek alternative measures to fragmented conservation priority areas.