• Title/Summary/Keyword: Gradient Boosting Regression

Search Result 78, Processing Time 0.039 seconds

Analysis of cycle racing ranking using statistical prediction models (통계적 예측모형을 활용한 경륜 경기 순위 분석)

  • Park, Gahee;Park, Rira;Song, Jongwoo
    • The Korean Journal of Applied Statistics
    • /
    • v.30 no.1
    • /
    • pp.25-39
    • /
    • 2017
  • Over 5 million people participate in cycle racing betting and its revenue is more than 2 trillion won. This study predicts the ranking of cycle racing using various statistical analyses and identifies important variables which have influence on ranking. We propose competitive ranking prediction models using various classification and regression methods. Our model can predict rankings with low misclassification rates most of the time. We found that the ranking increases as the grade of a racer decreases and as overall scores increase. Inversely, we can observe that the ranking decreases when the grade of a racer increases, race number four is given, and the ranking of the last race of a racer decreases. We also found that prediction accuracy can be improved when we use centered data per race instead of raw data. However, the real profit from the future data was not high when we applied our prediction model because our model can predict only low-return events well.

The Analysis of Private Education Cost for the Elementary, Middle, and High School Students in Korea (초,중,고 사교육비 영향요인 분석)

  • Lee, Hyejeong;Song, Jongwoo
    • The Korean Journal of Applied Statistics
    • /
    • v.27 no.7
    • /
    • pp.1125-1137
    • /
    • 2014
  • This paper studies what affects the private education cost for the elementary, middle, and high school students. It is a big issue now because there can be a problem in the equal opportunity for education if the portion of private education cost is very high in the total education cost. If we spend more time and money on the private education than the school education, it can cause the polarization among the classes and regions. The excessive private education also can deteriorate the school system. we use various regression and classification methods to analyze the cost of private education and find the important variables in the models. we found that large cities spend more money on the private education than small cities. We also found that high school students spend more than middle school students and the elementary students and the household with more income spend more money on the private education.

Forecasting of the COVID-19 pandemic situation of Korea

  • Goo, Taewan;Apio, Catherine;Heo, Gyujin;Lee, Doeun;Lee, Jong Hyeok;Lim, Jisun;Han, Kyulhee;Park, Taesung
    • Genomics & Informatics
    • /
    • v.19 no.1
    • /
    • pp.11.1-11.8
    • /
    • 2021
  • For the novel coronavirus disease 2019 (COVID-19), predictive modeling, in the literature, uses broadly susceptible exposed infected recoverd (SEIR)/SIR, agent-based, curve-fitting models. Governments and legislative bodies rely on insights from prediction models to suggest new policies and to assess the effectiveness of enforced policies. Therefore, access to accurate outbreak prediction models is essential to obtain insights into the likely spread and consequences of infectious diseases. The objective of this study is to predict the future COVID-19 situation of Korea. Here, we employed 5 models for this analysis; SEIR, local linear regression (LLR), negative binomial (NB) regression, segment Poisson, deep-learning based long short-term memory models (LSTM) and tree based gradient boosting machine (GBM). After prediction, model performance comparison was evelauated using relative mean squared errors (RMSE) for two sets of train (January 20, 2020-December 31, 2020 and January 20, 2020-January 31, 2021) and testing data (January 1, 2021-February 28, 2021 and February 1, 2021-February 28, 2021) . Except for segmented Poisson model, the other models predicted a decline in the daily confirmed cases in the country for the coming future. RMSE values' comparison showed that LLR, GBM, SEIR, NB, and LSTM respectively, performed well in the forecasting of the pandemic situation of the country. A good understanding of the epidemic dynamics would greatly enhance the control and prevention of COVID-19 and other infectious diseases. Therefore, with increasing daily confirmed cases since this year, these results could help in the pandemic response by informing decisions about planning, resource allocation, and decision concerning social distancing policies.

A Study on the Employee Turnover Prediction using XGBoost and SHAP (XGBoost와 SHAP 기법을 활용한 근로자 이직 예측에 관한 연구)

  • Lee, Jae Jun;Lee, Yu Rin;Lim, Do Hyun;Ahn, Hyun Chul
    • The Journal of Information Systems
    • /
    • v.30 no.4
    • /
    • pp.21-42
    • /
    • 2021
  • Purpose In order for companies to continue to grow, they should properly manage human resources, which are the core of corporate competitiveness. Employee turnover means the loss of talent in the workforce. When an employee voluntarily leaves his or her company, it will lose hiring and training cost and lead to the withdrawal of key personnel and new costs to train a new employee. From an employee's viewpoint, moving to another company is also risky because it can be time consuming and costly. Therefore, in order to reduce the social and economic costs caused by employee turnover, it is necessary to accurately predict employee turnover intention, identify the factors affecting employee turnover, and manage them appropriately in the company. Design/methodology/approach Prior studies have mainly used logistic regression and decision trees, which have explanatory power but poor predictive accuracy. In order to develop a more accurate prediction model, XGBoost is proposed as the classification technique. Then, to compensate for the lack of explainability, SHAP, one of the XAI techniques, is applied. As a result, the prediction accuracy of the proposed model is improved compared to the conventional methods such as LOGIT and Decision Trees. By applying SHAP to the proposed model, the factors affecting the overall employee turnover intention as well as a specific sample's turnover intention are identified. Findings Experimental results show that the prediction accuracy of XGBoost is superior to that of logistic regression and decision trees. Using SHAP, we find that jobseeking, annuity, eng_test, comm_temp, seti_dev, seti_money, equl_ablt, and sati_safe significantly affect overall employee turnover intention. In addition, it is confirmed that the factors affecting an individual's turnover intention are more diverse. Our research findings imply that companies should adopt a personalized approach for each employee in order to effectively prevent his or her turnover.

A Target Selection Model for the Counseling Services in Long-Term Care Insurance (노인장기요양보험 이용지원 상담 대상자 선정모형 개발)

  • Han, Eun-Jeong;Kim, Dong-Geon
    • The Korean Journal of Applied Statistics
    • /
    • v.28 no.6
    • /
    • pp.1063-1073
    • /
    • 2015
  • In the long-term care insurance (LTCI) system, National Health Insurance Service (NHIS) provide counseling services for beneficiaries and their family caregivers, which help them use LTC services appropriately. The purpose of this study was to develop a Target Selection Model for the Counseling Services based on needs of beneficiaries and their family caregivers. To develope models, we used data set of total 2,000 beneficiaries and family caregivers who have used the long-term care services in their home in March 2013 and completed questionnaires. The Target Selection Model was established through various data-mining models such as logistic regression, gradient boosting, Lasso, decision-tree model, Ensemble, and Neural network. Lasso model was selected as the final model because of the stability, high performance and availability. Our results might improve the satisfaction and the efficiency for the NHIS counseling services.

Asian Ethnic Group Classification Model Using Data Mining (데이터마이닝 방법을 이용한 아시아 민족 분류 모형 구축)

  • Kim, Yoon Geon;Lee, Ji Hyun;Cho, Sohee;Kim, Moon Young;Lee, Soong Deok;Ha, Eun Ho;Ahn, Jae Joon
    • The Korean Journal of Legal Medicine
    • /
    • v.41 no.2
    • /
    • pp.32-40
    • /
    • 2017
  • In addition to identifying genetic differences between target populations, it is also important to determine the impact of genetic differences with regard to the respective target populations. In recent years, there has been an increasing number of cases where this approach is needed, and thus various statistical methods must be considered. In this study, genetic data from populations of Southeast and Southwest Asia were collected, and several statistical approaches were evaluated on the Y-chromosome short tandem repeat data. In order to develop a more accurate and practical classification model, we applied gradient boosting and ensemble techniques. To infer between the Southeast and Southwest Asian populations, the overall performance of the classification models was better than that of the decision trees and regression models used in the past. In conclusion, this study suggests that additional statistical approaches, such as data mining techniques, could provide more useful interpretations for forensic analyses. These trials are expected to be the basis for further studies extending from target regions to the entire continent of Asia as well as the use of additional genes such as mitochondrial genes.

Feature selection and prediction modeling of drug responsiveness in Pharmacogenomics (약물유전체학에서 약물반응 예측모형과 변수선택 방법)

  • Kim, Kyuhwan;Kim, Wonkuk
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.2
    • /
    • pp.153-166
    • /
    • 2021
  • A main goal of pharmacogenomics studies is to predict individual's drug responsiveness based on high dimensional genetic variables. Due to a large number of variables, feature selection is required in order to reduce the number of variables. The selected features are used to construct a predictive model using machine learning algorithms. In the present study, we applied several hybrid feature selection methods such as combinations of logistic regression, ReliefF, TurF, random forest, and LASSO to a next generation sequencing data set of 400 epilepsy patients. We then applied the selected features to machine learning methods including random forest, gradient boosting, and support vector machine as well as a stacking ensemble method. Our results showed that the stacking model with a hybrid feature selection of random forest and ReliefF performs better than with other combinations of approaches. Based on a 5-fold cross validation partition, the mean test accuracy value of the best model was 0.727 and the mean test AUC value of the best model was 0.761. It also appeared that the stacking models outperform than single machine learning predictive models when using the same selected features.

Cross-Technology Localization: Leveraging Commodity WiFi to Localize Non-WiFi Device

  • Zhang, Dian;Zhang, Rujun;Guo, Haizhou;Xiang, Peng;Guo, Xiaonan
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.15 no.11
    • /
    • pp.3950-3969
    • /
    • 2021
  • Radio Frequency (RF)-based indoor localization technologies play significant roles in various Internet of Things (IoT) services (e.g., location-based service). Most such technologies require that all the devices comply with a specified technology (e.g., WiFi, ZigBee, and Bluetooth). However, this requirement limits its application scenarios in today's IoT context where multiple devices complied with different standards coexist in a shared environment. To bridge the gap, in this paper, we propose a cross-technology localization approach, which is able to localize target nodes using a different type of devices. Specifically, the proposed framework reuses the existing WiFi infrastructure without introducing additional cost to localize Non-WiFi device (i.e., ZigBee). The key idea is to leverage the interference between devices that share the same operating frequency (e.g., 2.4GHz). Such interference exhibits unique patterns that depend on the target device's location, thus it can be leveraged for cross-technology localization. The proposed framework uses Principal Components Analysis (PCA) to extract salient features of the received WiFi signals, and leverages Dynamic Time Warping (DTW), Gradient Boosting Regression Tree (GBRT) to improve the robustness of our system. We conduct experiments in real scenario and investigate the impact of different factors. Experimental results show that the average localization accuracy of our prototype can reach 1.54m, which demonstrates a promising direction of building cross-technology technologies to fulfill the needs of modern IoT context.

Machine Learning Model for Recommending Products and Estimating Sales Prices of Reverse Direct Purchase (역직구 상품 추천 및 판매가 추정을 위한 머신러닝 모델)

  • Kyu Ik Kim;Berdibayev Yergali;Soo Hyung Kim;Jin Suk Kim
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.46 no.2
    • /
    • pp.176-182
    • /
    • 2023
  • With about 80% of the global economy expected to shift to the global market by 2030, exports of reverse direct purchase products, in which foreign consumers purchase products from online shopping malls in Korea, are growing 55% annually. As of 2021, sales of reverse direct purchases in South Korea increased 50.6% from the previous year, surpassing 40 million. In order for domestic SMEs(Small and medium sized enterprises) to enter overseas markets, it is important to come up with export strategies based on various market analysis information, but for domestic small and medium-sized sellers, entry barriers are high, such as lack of information on overseas markets and difficulty in selecting local preferred products and determining competitive sales prices. This study develops an AI-based product recommendation and sales price estimation model to collect and analyze global shopping malls and product trends to provide marketing information that presents promising and appropriate product sales prices to small and medium-sized sellers who have difficulty collecting global market information. The product recommendation model is based on the LTR (Learning To Rank) methodology. As a result of comparing performance with nDCG, the Pair-wise-based XGBoost-LambdaMART Model was measured to be excellent. The sales price estimation model uses a regression algorithm. According to the R-Squared value, the Light Gradient Boosting Machine performs best in this model.

Monitoring Ground-level SO2 Concentrations Based on a Stacking Ensemble Approach Using Satellite Data and Numerical Models (위성 자료와 수치모델 자료를 활용한 스태킹 앙상블 기반 SO2 지상농도 추정)

  • Choi, Hyunyoung;Kang, Yoojin;Im, Jungho;Shin, Minso;Park, Seohui;Kim, Sang-Min
    • Korean Journal of Remote Sensing
    • /
    • v.36 no.5_3
    • /
    • pp.1053-1066
    • /
    • 2020
  • Sulfur dioxide (SO2) is primarily released through industrial, residential, and transportation activities, and creates secondary air pollutants through chemical reactions in the atmosphere. Long-term exposure to SO2 can result in a negative effect on the human body causing respiratory or cardiovascular disease, which makes the effective and continuous monitoring of SO2 crucial. In South Korea, SO2 monitoring at ground stations has been performed, but this does not provide spatially continuous information of SO2 concentrations. Thus, this research estimated spatially continuous ground-level SO2 concentrations at 1 km resolution over South Korea through the synergistic use of satellite data and numerical models. A stacking ensemble approach, fusing multiple machine learning algorithms at two levels (i.e., base and meta), was adopted for ground-level SO2 estimation using data from January 2015 to April 2019. Random forest and extreme gradient boosting were used as based models and multiple linear regression was adopted for the meta-model. The cross-validation results showed that the meta-model produced the improved performance by 25% compared to the base models, resulting in the correlation coefficient of 0.48 and root-mean-square-error of 0.0032 ppm. In addition, the temporal transferability of the approach was evaluated for one-year data which were not used in the model development. The spatial distribution of ground-level SO2 concentrations based on the proposed model agreed with the general seasonality of SO2 and the temporal patterns of emission sources.