• Title/Summary/Keyword: Ensemble Boosting Model

Search Result 71, Processing Time 0.025 seconds

Optimal Sensor Location in Water Distribution Network using XGBoost Model (XGBoost 기반 상수도관망 센서 위치 최적화)

  • Hyewoon Jang;Donghwi Jung
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2023.05a
    • /
    • pp.217-217
    • /
    • 2023
  • 상수도관망은 사용자에게 고품질의 물을 안정적으로 공급하는 것을 목적으로 하며, 이를 평가하기 위한 지표 중 하나로 압력을 활용한다. 최근 스마트 센서의 설치가 확장됨에 따라 기계학습기법을 이용한 실시간 데이터 기반의 분석이 활발하다. 따라서 어디에서 데이터를 수집하느냐에 대한 센서 위치 결정이 중요하다. 본 연구는 eXtreme Gradient Boosting(XGBoost) 모델을 활용하여 대규모 상수도관망 내 센서 위치를 최적화하는 방법론을 제안한다. XGBoost 모델은 여러 의사결정 나무(decision tree)를 활용하는 앙상블(ensemble) 모델이며, 오차에 따른 가중치를 부여하여 성능을 향상시키는 부스팅(boosting) 방식을 이용한다. 이는 분산 및 병렬 처리가 가능해 메모리리소스를 최적으로 사용하고, 학습 속도가 빠르며 결측치에 대한 전처리 과정을 모델 내에 포함하고 있다는 장점이 있다. 모델 구현을 위한 독립 변수 결정을 위해 압력 데이터의 변동성 및 평균압력 값을 고려하여 상수도관망을 대표하는 중요 절점(critical node)를 선정한다. 중요 절점의 압력 값을 예측하는 XGBoost 모델을 구축하고 모델의 성능과 요인 중요도(feature importance) 값을 고려하여 센서의 최적 위치를 선정한다. 이러한 방법론을 기반으로 상수도관망의 특성에 따른 경향성을 파악하기 위해 다양한 형태(예를 들어, 망형, 가지형)와 구성 절점의 수를 변화시키며 결과를 분석한다. 본 연구에서 구축한 XGBoost 모델은 추가적인 전처리 과정을 최소화하며 대규모 관망에 간편하게 사용할 수 있어 추후 다양한 입출력 데이터의 조합을 통해 센서 위치 외에도 상수도관망에서의 성능 최적화에 활용할 수 있을 것으로 기대한다.

  • PDF

Development of Prediction Model of Chloride Diffusion Coefficient using Machine Learning (기계학습을 이용한 염화물 확산계수 예측모델 개발)

  • Kim, Hyun-Su
    • Journal of Korean Association for Spatial Structures
    • /
    • v.23 no.3
    • /
    • pp.87-94
    • /
    • 2023
  • Chloride is one of the most common threats to reinforced concrete (RC) durability. Alkaline environment of concrete makes a passive layer on the surface of reinforcement bars that prevents the bar from corrosion. However, when the chloride concentration amount at the reinforcement bar reaches a certain level, deterioration of the passive protection layer occurs, causing corrosion and ultimately reducing the structure's safety and durability. Therefore, understanding the chloride diffusion and its prediction are important to evaluate the safety and durability of RC structure. In this study, the chloride diffusion coefficient is predicted by machine learning techniques. Various machine learning techniques such as multiple linear regression, decision tree, random forest, support vector machine, artificial neural networks, extreme gradient boosting annd k-nearest neighbor were used and accuracy of there models were compared. In order to evaluate the accuracy, root mean square error (RMSE), mean square error (MSE), mean absolute error (MAE) and coefficient of determination (R2) were used as prediction performance indices. The k-fold cross-validation procedure was used to estimate the performance of machine learning models when making predictions on data not used during training. Grid search was applied to hyperparameter optimization. It has been shown from numerical simulation that ensemble learning methods such as random forest and extreme gradient boosting successfully predicted the chloride diffusion coefficient and artificial neural networks also provided accurate result.

Ensemble deep learning-based models to predict the resilient modulus of modified base materials subjected to wet-dry cycles

  • Mahzad Esmaeili-Falak;Reza Sarkhani Benemaran
    • Geomechanics and Engineering
    • /
    • v.32 no.6
    • /
    • pp.583-600
    • /
    • 2023
  • The resilient modulus (MR) of various pavement materials plays a significant role in the pavement design by a mechanistic-empirical method. The MR determination is done by experimental tests that need time and money, along with special experimental tools. The present paper suggested a novel hybridized extreme gradient boosting (XGB) structure for forecasting the MR of modified base materials subject to wet-dry cycles. The models were created by various combinations of input variables called deep learning. Input variables consist of the number of W-D cycles (WDC), the ratio of free lime to SAF (CSAFR), the ratio of maximum dry density to the optimum moisture content (DMR), confining pressure (σ3), and deviatoric stress (σd). Two XGB structures were produced for the estimation aims, where determinative variables were optimized by particle swarm optimization (PSO) and black widow optimization algorithm (BWOA). According to the results' description and outputs of Taylor diagram, M1 model with the combination of WDC, CSAFR, DMR, σ3, and σd is recognized as the most suitable model, with R2 and RMSE values of BWOA-XGB for model M1 equal to 0.9991 and 55.19 MPa, respectively. Interestingly, the lowest value of RMSE for literature was at 116.94 MPa, while this study could gain the extremely lower RMSE owned by BWOA-XGB model at 55.198 MPa. At last, the explanations indicate the BWO algorithm's capability in determining the optimal value of XGB determinative parameters in MR prediction procedure.

Feature selection and prediction modeling of drug responsiveness in Pharmacogenomics (약물유전체학에서 약물반응 예측모형과 변수선택 방법)

  • Kim, Kyuhwan;Kim, Wonkuk
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.2
    • /
    • pp.153-166
    • /
    • 2021
  • A main goal of pharmacogenomics studies is to predict individual's drug responsiveness based on high dimensional genetic variables. Due to a large number of variables, feature selection is required in order to reduce the number of variables. The selected features are used to construct a predictive model using machine learning algorithms. In the present study, we applied several hybrid feature selection methods such as combinations of logistic regression, ReliefF, TurF, random forest, and LASSO to a next generation sequencing data set of 400 epilepsy patients. We then applied the selected features to machine learning methods including random forest, gradient boosting, and support vector machine as well as a stacking ensemble method. Our results showed that the stacking model with a hybrid feature selection of random forest and ReliefF performs better than with other combinations of approaches. Based on a 5-fold cross validation partition, the mean test accuracy value of the best model was 0.727 and the mean test AUC value of the best model was 0.761. It also appeared that the stacking models outperform than single machine learning predictive models when using the same selected features.

Prediction on Busan's Gross Product and Employment of Major Industry with Logistic Regression and Machine Learning Model (로지스틱 회귀모형과 머신러닝 모형을 활용한 주요산업의 부산 지역총생산 및 고용 효과 예측)

  • Chae-Deug Yi
    • Korea Trade Review
    • /
    • v.47 no.2
    • /
    • pp.69-88
    • /
    • 2022
  • This paper aims to predict Busan's regional product and employment using the logistic regression models and machine learning models. The following are the main findings of the empirical analysis. First, the OLS regression model shows that the main industries such as electricity and electronics, machine and transport, and finance and insurance affect the Busan's income positively. Second, the binomial logistic regression models show that the Busan's strategic industries such as the future transport machinery, life-care, and smart marine industries contribute on the Busan's income in large order. Third, the multinomial logistic regression models show that the Korea's main industries such as the precise machinery, transport equipment, and machinery influence the Busan's economy positively. And Korea's exports and the depreciation can affect Busan's economy more positively at the higher employment level. Fourth, the voting ensemble model show the higher predictive power than artificial neural network model and support vector machine models. Furthermore, the gradient boosting model and the random forest show the higher predictive power than the voting model in large order.

Classifying the severity of pedestrian accidents using ensemble machine learning algorithms: A case study of Daejeon City (앙상블 학습기법을 활용한 보행자 교통사고 심각도 분류: 대전시 사례를 중심으로)

  • Kang, Heungsik;Noh, Myounggyu
    • Journal of Digital Convergence
    • /
    • v.20 no.5
    • /
    • pp.39-46
    • /
    • 2022
  • As the link between traffic accidents and social and economic losses has been confirmed, there is a growing interest in developing safety policies based on crash data and a need for countermeasures to reduce severe crash outcomes such as severe injuries and fatalities. In this study, we select Daejeon city where the relative proportion of fatal crashes is high, as a case study region and focus on the severity of pedestrian crashes. After a series of data manipulation process, we run machine learning algorithms for the optimal model selection and variable identification. Of nine algorithms applied, AdaBoost and Random Forest (ensemble based ones) outperform others in terms of performance metrics. Based on the results, we identify major influential factors (i.e., the age of pedestrian as 70s or 20s, pedestrian crossing) on pedestrian crashes in Daejeon, and suggest them as measures for reducing severe outcomes.

Asian Ethnic Group Classification Model Using Data Mining (데이터마이닝 방법을 이용한 아시아 민족 분류 모형 구축)

  • Kim, Yoon Geon;Lee, Ji Hyun;Cho, Sohee;Kim, Moon Young;Lee, Soong Deok;Ha, Eun Ho;Ahn, Jae Joon
    • The Korean Journal of Legal Medicine
    • /
    • v.41 no.2
    • /
    • pp.32-40
    • /
    • 2017
  • In addition to identifying genetic differences between target populations, it is also important to determine the impact of genetic differences with regard to the respective target populations. In recent years, there has been an increasing number of cases where this approach is needed, and thus various statistical methods must be considered. In this study, genetic data from populations of Southeast and Southwest Asia were collected, and several statistical approaches were evaluated on the Y-chromosome short tandem repeat data. In order to develop a more accurate and practical classification model, we applied gradient boosting and ensemble techniques. To infer between the Southeast and Southwest Asian populations, the overall performance of the classification models was better than that of the decision trees and regression models used in the past. In conclusion, this study suggests that additional statistical approaches, such as data mining techniques, could provide more useful interpretations for forensic analyses. These trials are expected to be the basis for further studies extending from target regions to the entire continent of Asia as well as the use of additional genes such as mitochondrial genes.

Prediction of compressive strength of sustainable concrete using machine learning tools

  • Lokesh Choudhary;Vaishali Sahu;Archanaa Dongre;Aman Garg
    • Computers and Concrete
    • /
    • v.33 no.2
    • /
    • pp.137-145
    • /
    • 2024
  • The technique of experimentally determining concrete's compressive strength for a given mix design is time-consuming and difficult. The goal of the current work is to propose a best working predictive model based on different machine learning algorithms such as Gradient Boosting Machine (GBM), Stacked Ensemble (SE), Distributed Random Forest (DRF), Extremely Randomized Trees (XRT), Generalized Linear Model (GLM), and Deep Learning (DL) that can forecast the compressive strength of ternary geopolymer concrete mix without carrying out any experimental procedure. A geopolymer mix uses supplementary cementitious materials obtained as industrial by-products instead of cement. The input variables used for assessing the best machine learning algorithm not only include individual ingredient quantities, but molarity of the alkali activator and age of testing as well. Myriad statistical parameters used to measure the effectiveness of the models in forecasting the compressive strength of ternary geopolymer concrete mix, it has been found that GBM performs better than all other algorithms. A sensitivity analysis carried out towards the end of the study suggests that GBM model predicts results close to the experimental conditions with an accuracy between 95.6 % to 98.2 % for testing and training datasets.

The effect of soil physical properties on predicting shear strength parameters based on comparing ensemble learning, deep learning, and support vector machine models

  • Ba-Quang-Vinh Nguyen;Yun-Tae Kim
    • Geomechanics and Engineering
    • /
    • v.39 no.3
    • /
    • pp.241-256
    • /
    • 2024
  • The shear strength (SS) of soil is a critical parameter utilized in the design of civil engineering projects. The SS parameters, including cohesion (c) and friction angle (𝜑), can be determined through methods conducted either in the field or within a laboratory environment. However, the traditional method for determining SS parameters are not only costly but also time-consuming. Recently, the application of machine learning (ML) in geotechnical problems has received increasing attention. In order to select an appropriate ML model and assess the effect of physical properties on the SS of soil. This research endeavors to predict critical SS parameters of soil through the application of five machine learning (ML) models, integrating easily-available physical soil index, including specific gravity (G), saturation degree (Sr), liquid limit (LL), silt content (SC), and clay content (CC). The used ML techniques include Extreme Gradient Boosting (XGBoost), Random Forest (RF), Multilayer Perceptron (MLP), Support Vector Machine (SVM), and Convolutional Neural Network (CNN). A range of metrics, encompassing the root mean square error (RMSE), mean absolute error (MAE), and determination coefficient (R2) were used to measure the predictive efficacy of the employed models as well as compare the performance of the used ML models. The values of R2 range from 0.769 to 0.987 indicate that all ML models exhibit excellent predictive capabilities for estimating SS parameters, in which the XGBoost, and CNN techniques show outperforming results compared to the other models. The study uses decision tree feature importance (DTFI) and coefficient feature importance (CFI) techniques to investigate how various physical properties impact the predictive capabilities of the model and indicates that both G and LL have a substantial impact on the predictive accuracy of cohesion and friction angle.

Vacant House Prediction and Important Features Exploration through Artificial Intelligence: In Case of Gunsan (인공지능 기반 빈집 추정 및 주요 특성 분석)

  • Lim, Gyoo Gun;Noh, Jong Hwa;Lee, Hyun Tae;Ahn, Jae Ik
    • Journal of Information Technology Services
    • /
    • v.21 no.3
    • /
    • pp.63-72
    • /
    • 2022
  • The extinction crisis of local cities, caused by a population density increase phenomenon in capital regions, directly causes the increase of vacant houses in local cities. According to population and housing census, Gunsan-si has continuously shown increasing trend of vacant houses during 2015 to 2019. In particular, since Gunsan-si is the city which suffers from doughnut effect and industrial decline, problems regrading to vacant house seems to exacerbate. This study aims to provide a foundation of a system which can predict and deal with the building that has high risk of becoming vacant house through implementing a data driven vacant house prediction machine learning model. Methodologically, this study analyzes three types of machine learning model by differing the data components. First model is trained based on building register, individual declared land value, house price and socioeconomic data and second model is trained with the same data as first model but with additional POI(Point of Interest) data. Finally, third model is trained with same data as the second model but with excluding water usage and electricity usage data. As a result, second model shows the best performance based on F1-score. Random Forest, Gradient Boosting Machine, XGBoost and LightGBM which are tree ensemble series, show the best performance as a whole. Additionally, the complexity of the model can be reduced through eliminating independent variables that have correlation coefficient between the variables and vacant house status lower than the 0.1 based on absolute value. Finally, this study suggests XGBoost and LightGBM based machine learning model, which can handle missing values, as final vacant house prediction model.