• 제목/요약/키워드: XGboost

Search Result 238, Processing Time 0.027 seconds

Analysis of Malware Group Classification with eXplainable Artificial Intelligence (XAI기반 악성코드 그룹분류 결과 해석 연구)

  • Kim, Do-yeon;Jeong, Ah-yeon;Lee, Tae-jin
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.31 no.4
    • /
    • pp.559-571
    • /
    • 2021
  • Along with the increase prevalence of computers, the number of malware distributions by attackers to ordinary users has also increased. Research to detect malware continues to this day, and in recent years, research on malware detection and analysis using AI is focused. However, the AI algorithm has a disadvantage that it cannot explain why it detects and classifies malware. XAI techniques have emerged to overcome these limitations of AI and make it practical. With XAI, it is possible to provide a basis for judgment on the final outcome of the AI. In this paper, we conducted malware group classification using XGBoost and Random Forest, and interpreted the results through SHAP. Both classification models showed a high classification accuracy of about 99%, and when comparing the top 20 API features derived through XAI with the main APIs of malware, it was possible to interpret and understand more than a certain level. In the future, based on this, a direct AI reliability improvement study will be conducted.

Prediction Model of CNC Processing Defects Using Machine Learning (머신러닝을 이용한 CNC 가공 불량 발생 예측 모델)

  • Han, Yong Hee
    • Journal of the Korea Convergence Society
    • /
    • v.13 no.2
    • /
    • pp.249-255
    • /
    • 2022
  • This study proposed an analysis framework for real-time prediction of CNC processing defects using machine learning-based models that are recently attracting attention as processing defect prediction methods, and applied it to CNC machines. Analysis shows that the XGBoost, CatBoost, and LightGBM models have the same best accuracy, precision, recall, F1 score, and AUC, of which the LightGBM model took the shortest execution time. This short run time has practical advantages such as reducing actual system deployment costs, reducing the probability of CNC machine damage due to rapid prediction of defects, and increasing overall CNC machine utilization, confirming that the LightGBM model is the most effective machine learning model for CNC machines with only basic sensors installed. In addition, it was confirmed that classification performance was maximized when an ensemble model consisting of LightGBM, ExtraTrees, k-Nearest Neighbors, and logistic regression models was applied in situations where there are no restrictions on execution time and computing power.

Prediction of Vertical Sea Water Temperature Profile in the East Sea Based on Machine Learning and XBT Data

  • Kim, Young-Joo;Lee, Soo-Jin;Kim, Young-Won
    • Journal of the Korea Society of Computer and Information
    • /
    • v.27 no.11
    • /
    • pp.47-55
    • /
    • 2022
  • Recently, researches on the prediction of sea water temperature using artificial intelligence models has been actively conducted in Korea. However, most researches in the sea around the Korean peninsula mainly focus on predicting sea surface temperatures. Unlike previous researches, this research predicted the vertical sea water temperature profile of the East Sea, which is very important in submarine operations and anti-submarine warfare, using XBT(eXpendable Bathythermograph) data and machine learning models(RandomForest, XGBoost, LightGBM). The model was trained using XBT data measured from sea surface to depth of 200m in a specific area of the East Sea, and the prediction accuracy was evaluated through MAE(Mean Absolute Error) and vertical sea water temperature profile graphs.

Vacant House Prediction and Important Features Exploration through Artificial Intelligence: In Case of Gunsan (인공지능 기반 빈집 추정 및 주요 특성 분석)

  • Lim, Gyoo Gun;Noh, Jong Hwa;Lee, Hyun Tae;Ahn, Jae Ik
    • Journal of Information Technology Services
    • /
    • v.21 no.3
    • /
    • pp.63-72
    • /
    • 2022
  • The extinction crisis of local cities, caused by a population density increase phenomenon in capital regions, directly causes the increase of vacant houses in local cities. According to population and housing census, Gunsan-si has continuously shown increasing trend of vacant houses during 2015 to 2019. In particular, since Gunsan-si is the city which suffers from doughnut effect and industrial decline, problems regrading to vacant house seems to exacerbate. This study aims to provide a foundation of a system which can predict and deal with the building that has high risk of becoming vacant house through implementing a data driven vacant house prediction machine learning model. Methodologically, this study analyzes three types of machine learning model by differing the data components. First model is trained based on building register, individual declared land value, house price and socioeconomic data and second model is trained with the same data as first model but with additional POI(Point of Interest) data. Finally, third model is trained with same data as the second model but with excluding water usage and electricity usage data. As a result, second model shows the best performance based on F1-score. Random Forest, Gradient Boosting Machine, XGBoost and LightGBM which are tree ensemble series, show the best performance as a whole. Additionally, the complexity of the model can be reduced through eliminating independent variables that have correlation coefficient between the variables and vacant house status lower than the 0.1 based on absolute value. Finally, this study suggests XGBoost and LightGBM based machine learning model, which can handle missing values, as final vacant house prediction model.

Comparison of Machine Learning Model Performance based on Observation Methods using Naked-eye and Visibility-meter (머신러닝을 이용한 안개 예측 시 목측과 시정계 계측 방법에 따른 모델 성능 차이 비교)

  • Changhyoun Park;Soon-hwan Lee
    • Journal of the Korean earth science society
    • /
    • v.44 no.2
    • /
    • pp.105-118
    • /
    • 2023
  • In this study, we predicted the presence of fog with a one-hour delay using the XGBoost DART machine learning algorithm for Andong, which had the highest occurrence of fog among inland stations from 2016 to 2020. We used six datasets: meteorological data, agricultural observation data, additional derived data, and their expanded data. The weather phenomenon numbers obtained through naked-eye observations and the visibility distances measured by visibility meters were classified as fog [1] or no-fog [0]. We set up twelve machine learning modeling experiments and used data from 2021 for model validation. We mainly evaluated model performance using recall and AUC-ROC, considering the harmful effects of fog on society and local communities. The combination of oversampled meteorological data features and the target induced by weather phenomenon numbers showed the best performance. This result highlights the importance of naked-eye observations in predicting fog using machine learning algorithms.

Personalized Diabetes Risk Assessment Through Multifaceted Analysis (PD- RAMA): A Novel Machine Learning Approach to Early Detection and Management of Type 2 Diabetes

  • Gharbi Alshammari
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.8
    • /
    • pp.17-25
    • /
    • 2023
  • The alarming global prevalence of Type 2 Diabetes Mellitus (T2DM) has catalyzed an urgent need for robust, early diagnostic methodologies. This study unveils a pioneering approach to predicting T2DM, employing the Extreme Gradient Boosting (XGBoost) algorithm, renowned for its predictive accuracy and computational efficiency. The investigation harnesses a meticulously curated dataset of 4303 samples, extracted from a comprehensive Chinese research study, scrupulously aligned with the World Health Organization's indicators and standards. The dataset encapsulates a multifaceted spectrum of clinical, demographic, and lifestyle attributes. Through an intricate process of hyperparameter optimization, the XGBoost model exhibited an unparalleled best score, elucidating a distinctive combination of parameters such as a learning rate of 0.1, max depth of 3, 150 estimators, and specific colsample strategies. The model's validation accuracy of 0.957, coupled with a sensitivity of 0.9898 and specificity of 0.8897, underlines its robustness in classifying T2DM. A detailed analysis of the confusion matrix further substantiated the model's diagnostic prowess, with an F1-score of 0.9308, illustrating its balanced performance in true positive and negative classifications. The precision and recall metrics provided nuanced insights into the model's ability to minimize false predictions, thereby enhancing its clinical applicability. The research findings not only underline the remarkable efficacy of XGBoost in T2DM prediction but also contribute to the burgeoning field of machine learning applications in personalized healthcare. By elucidating a novel paradigm that accentuates the synergistic integration of multifaceted clinical parameters, this study fosters a promising avenue for precise early detection, risk stratification, and patient-centric intervention in diabetes care. The research serves as a beacon, inspiring further exploration and innovation in leveraging advanced analytical techniques for transformative impacts on predictive diagnostics and chronic disease management.

A Study on Predictive Modeling of I-131 Radioactivity Based on Machine Learning (머신러닝 기반 고용량 I-131의 용량 예측 모델에 관한 연구)

  • Yeon-Wook You;Chung-Wun Lee;Jung-Soo Kim
    • Journal of radiological science and technology
    • /
    • v.46 no.2
    • /
    • pp.131-139
    • /
    • 2023
  • High-dose I-131 used for the treatment of thyroid cancer causes localized exposure among radiology technologists handling it. There is a delay between the calibration date and when the dose of I-131 is administered to a patient. Therefore, it is necessary to directly measure the radioactivity of the administered dose using a dose calibrator. In this study, we attempted to apply machine learning modeling to measured external dose rates from shielded I-131 in order to predict their radioactivity. External dose rates were measured at 1 m, 0.3 m, and 0.1 m distances from a shielded container with the I-131, with a total of 868 sets of measurements taken. For the modeling process, we utilized the hold-out method to partition the data with a 7:3 ratio (609 for the training set:259 for the test set). For the machine learning algorithms, we chose linear regression, decision tree, random forest and XGBoost. To evaluate the models, we calculated root mean square error (RMSE), mean square error (MSE), and mean absolute error (MAE) to evaluate accuracy and R2 to evaluate explanatory power. Evaluation results are as follows. Linear regression (RMSE 268.15, MSE 71901.87, MAE 231.68, R2 0.92), decision tree (RMSE 108.89, MSE 11856.92, MAE 19.24, R2 0.99), random forest (RMSE 8.89, MSE 79.10, MAE 6.55, R2 0.99), XGBoost (RMSE 10.21, MSE 104.22, MAE 7.68, R2 0.99). The random forest model achieved the highest predictive ability. Improving the model's performance in the future is expected to contribute to lowering exposure among radiology technologists.

Comparison of Chlorophyll-a Prediction and Analysis of Influential Factors in Yeongsan River Using Machine Learning and Deep Learning (머신러닝과 딥러닝을 이용한 영산강의 Chlorophyll-a 예측 성능 비교 및 변화 요인 분석)

  • Sun-Hee, Shim;Yu-Heun, Kim;Hye Won, Lee;Min, Kim;Jung Hyun, Choi
    • Journal of Korean Society on Water Environment
    • /
    • v.38 no.6
    • /
    • pp.292-305
    • /
    • 2022
  • The Yeongsan River, one of the four largest rivers in South Korea, has been facing difficulties with water quality management with respect to algal bloom. The algal bloom menace has become bigger, especially after the construction of two weirs in the mainstream of the Yeongsan River. Therefore, the prediction and factor analysis of Chlorophyll-a (Chl-a) concentration is needed for effective water quality management. In this study, Chl-a prediction model was developed, and the performance evaluated using machine and deep learning methods, such as Deep Neural Network (DNN), Random Forest (RF), and eXtreme Gradient Boosting (XGBoost). Moreover, the correlation analysis and the feature importance results were compared to identify the major factors affecting the concentration of Chl-a. All models showed high prediction performance with an R2 value of 0.9 or higher. In particular, XGBoost showed the highest prediction accuracy of 0.95 in the test data.The results of feature importance suggested that Ammonia (NH3-N) and Phosphate (PO4-P) were common major factors for the three models to manage Chl-a concentration. From the results, it was confirmed that three machine learning methods, DNN, RF, and XGBoost are powerful methods for predicting water quality parameters. Also, the comparison between feature importance and correlation analysis would present a more accurate assessment of the important major factors.

Prediction of Semiconductor Exposure Process Measurement Results using XGBoost (XGBoost를 사용한 반도체 노광 공정 계측 결과 예측)

  • Shin, Jeong Il;Park, Ji Su;Shon, Jin Gon
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2021.05a
    • /
    • pp.505-508
    • /
    • 2021
  • 반도체 회로의 미세화로 단위 공정이 증가하면 TAT(turn-around time) 증가에 따른 제조 비용이 늘어난다. 반도체 공정 중 포토 공정은 마스크의 회로를 웨이퍼에 전사하는 공정으로 전사를 담당하는 노광장비의 성능에 의해 회로의 정확성이 결정된다. 이런 정확성을 검증하는 계측공정은 회로의 미세화가 진행될수록 필요성은 증가하나 TAT 증가의 주된 요인으로 최근 기계학습을 사용한 다양한 예측 모형들의 개발로 계측 결과를 예측하는 실험들이 진행되고 있다. 본 논문은 노광장비 센서들의 이상값을 감지하여 분류 후 계측공정을 진행하는 LFDC(Lithography Fault Detection and Classification) 시스템의 문제인 분류 성능이 떨어지는 것을 해결하기 위해 XGBoost를 사용하여 계측공정을 진행하지 않고 노광장비 센서의 이상값을 학습된 학습기를 통해 분류하여 포토 공정을 재진행하거나 다음 공정을 진행하는 방법을 실험하였다. 실험에서 사용된 계측 결과 예측 모형은 89%의 정확도를 확보하였고 반도체 데이터 특성인 심각한 불균형의 데이터에 대해서도 같은 정확도를 얻었다. 이런 결과는 노광장비 센서들의 이상값에 대해 89%는 정상으로 판단하였고 정상으로 판단한 웨이퍼를 실제 계측 시 예측과 같은 결과를 얻었다. 계측 결과 예측 모형을 사용하면 실제 계측을 진행하지 않고 노광장비 센서들의 이상값에 대한 판정을 할 수 있어 TAT 단축으로 제조 비용감소, 계측 장비 부하 감소 및 효율 향상을 할 수 있다. 하지만 본 논문에서는 90%의 성능을 보이는 계측 결과 예측 모형으로 여전히 10%에 대해서는 실제 계측이 필요한 문제에 대해 추후 더 연구가 필요하다.

Development of a Resignation Prediction Model using HR Data (HR 데이터 기반의 퇴사 예측 모델 개발)

  • PARK, YUNJUNG;Lee, Do-Gil
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2021.05a
    • /
    • pp.100-103
    • /
    • 2021
  • Most companies study why employees resign their jobs to prevent the outflow of excellent human resources. To obtain the data needed for the study, employees are interviewed or surveyed before resignation. However, it is difficult to get accurate results because employees do not want to express their opinions that may be disadvantageous to working in a survey. Meanwhile, according to the data released by the Korea Labor Institute, the greater the difference between the minimum level of education required by companies and the level of employees' academic background, the greater the tendency to resign jobs. Therefore, based on these data, in this study, we would like to predict whether employees will leave the company based on data such as major, education level and company type. We generate four kinds of resignation prediction models using Decision Tree, XGBoost, kNN and SVM, and compared their respective performance. As a result, we could identify various factors that were not covered in previous study. It is expected that the resignation prediction model help companies recognize employees who intend to leave the company in advance.

  • PDF