• Title/Summary/Keyword: Light GBM

Search Result 91, Processing Time 0.026 seconds

Store Sales Prediction Using Gradient Boosting Model (그래디언트 부스팅 모델을 활용한 상점 매출 예측)

  • Choi, Jaeyoung;Yang, Heeyoon;Oh, Hayoung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.2
    • /
    • pp.171-177
    • /
    • 2021
  • Through the rapid developments in machine learning, there have been diverse utilization approaches not only in industrial fields but also in daily life. Implementations of machine learning on financial data, also have been of interest. Herein, we employ machine learning algorithms to store sales data and present future applications for fintech enterprises. We utilize diverse missing data processing methods to handle missing data and apply gradient boosting machine learning algorithms; XGBoost, LightGBM, CatBoost to predict the future revenue of individual stores. As a result, we found that using median imputation onto missing data with the appliance of the xgboost algorithm has the best accuracy. By employing the proposed method, fintech enterprises and customers can attain benefits. Stores can benefit by receiving financial assistance beforehand from fintech companies, while these corporations can benefit by offering financial support to these stores with low risk.

A Study on Fraud Detection in the C2C Used Trade Market Using Doc2vec

  • Lim, Do Hyun;Ahn, Hyunchul
    • Journal of the Korea Society of Computer and Information
    • /
    • v.27 no.3
    • /
    • pp.173-182
    • /
    • 2022
  • In this paper, we propose a machine learning model that can prevent fraudulent transactions in advance and interpret them using the XAI approach. For the experiment, we collected a real data set of 12,258 mobile phone sales posts from Joonggonara, a major domestic online C2C resale trading platform. Characteristics of the text corresponding to the post body were extracted using Doc2vec, dimensionality was reduced through PCA, and various derived variables were created based on previous research. To mitigate the data imbalance problem in the preprocessing stage, a complex sampling method that combines oversampling and undersampling was applied. Then, various machine learning models were built to detect fraudulent postings. As a result of the analysis, LightGBM showed the best performance compared to other machine learning models. And as a result of SHAP, if the price is unreasonably low compared to the market price and if there is no indication of the transaction area, there was a high probability that it was a fraudulent post. Also, high price, no safe transaction, the more the courier transaction, and the higher the ratio of 0 in the price also led to fraud.

Prediction of Vertical Sea Water Temperature Profile in the East Sea Based on Machine Learning and XBT Data

  • Kim, Young-Joo;Lee, Soo-Jin;Kim, Young-Won
    • Journal of the Korea Society of Computer and Information
    • /
    • v.27 no.11
    • /
    • pp.47-55
    • /
    • 2022
  • Recently, researches on the prediction of sea water temperature using artificial intelligence models has been actively conducted in Korea. However, most researches in the sea around the Korean peninsula mainly focus on predicting sea surface temperatures. Unlike previous researches, this research predicted the vertical sea water temperature profile of the East Sea, which is very important in submarine operations and anti-submarine warfare, using XBT(eXpendable Bathythermograph) data and machine learning models(RandomForest, XGBoost, LightGBM). The model was trained using XBT data measured from sea surface to depth of 200m in a specific area of the East Sea, and the prediction accuracy was evaluated through MAE(Mean Absolute Error) and vertical sea water temperature profile graphs.

Vacant House Prediction and Important Features Exploration through Artificial Intelligence: In Case of Gunsan (인공지능 기반 빈집 추정 및 주요 특성 분석)

  • Lim, Gyoo Gun;Noh, Jong Hwa;Lee, Hyun Tae;Ahn, Jae Ik
    • Journal of Information Technology Services
    • /
    • v.21 no.3
    • /
    • pp.63-72
    • /
    • 2022
  • The extinction crisis of local cities, caused by a population density increase phenomenon in capital regions, directly causes the increase of vacant houses in local cities. According to population and housing census, Gunsan-si has continuously shown increasing trend of vacant houses during 2015 to 2019. In particular, since Gunsan-si is the city which suffers from doughnut effect and industrial decline, problems regrading to vacant house seems to exacerbate. This study aims to provide a foundation of a system which can predict and deal with the building that has high risk of becoming vacant house through implementing a data driven vacant house prediction machine learning model. Methodologically, this study analyzes three types of machine learning model by differing the data components. First model is trained based on building register, individual declared land value, house price and socioeconomic data and second model is trained with the same data as first model but with additional POI(Point of Interest) data. Finally, third model is trained with same data as the second model but with excluding water usage and electricity usage data. As a result, second model shows the best performance based on F1-score. Random Forest, Gradient Boosting Machine, XGBoost and LightGBM which are tree ensemble series, show the best performance as a whole. Additionally, the complexity of the model can be reduced through eliminating independent variables that have correlation coefficient between the variables and vacant house status lower than the 0.1 based on absolute value. Finally, this study suggests XGBoost and LightGBM based machine learning model, which can handle missing values, as final vacant house prediction model.

Data Quality Assessment and Improvement for Water Level Prediction of the Han River (한강 수위 예측을 위한 데이터 품질 진단 및 개선)

  • Ji-Hyun Choi;Jin-Yeop Kang;Hyun Ahn
    • Journal of Advanced Navigation Technology
    • /
    • v.27 no.1
    • /
    • pp.133-138
    • /
    • 2023
  • As a side effect of recent rapid climate change and global warming, the frequency and scale of flood disasters are increasing worldwide. In Korea, the water level of the Han River is a major management target for preventing flood disasters in Seoul, the capital of Korea. In this paper, to improve the water level prediction of the Han River based on machine learning, we perform a comprehensive assessment of the quality of related dataset and propose data preprocessing methods to improve it. Specifically, we improve the dataset in terms of completeness, validity, and accuracy through missing value processing and cross-correlation analysis. In addition, we conduct a performance evaluation using random forest and LightGBM to analyze the effect of the proposed data improvement method on the water level prediction performance of the Han River.

A Study on Predicting Student Dropout in College: The Importance of Early Academic Performance (전문대학 학생의 학업중단 예측에 관한 연구: 초기 학업 성적의 중요성)

  • Sangjo Oh;JiHwan Sim
    • Journal of Industrial Convergence
    • /
    • v.22 no.2
    • /
    • pp.23-32
    • /
    • 2024
  • This study utilized minimum number of demographic variables and first-semester GPA of students to predict the final academic status of students at a vocational college in Seoul. The results from XGBoost and LightGBM models revealed that these variables significantly impacted the prediction of students' dismissal. This suggests that early academic performance could be an important indicator of potential academic dropout. Additionally, the possibility that academic years required to award an associate degree at the vocational college could influence the final academic status was confirmed, indicating that the duration of study is a crucial factor in students' decisions to discontinue their studies. The study attempted to model without relying on psychological, social, or economic factors, focusing solely on academic achievement. This is expected to aid in the development of an early warning system for preventing academic dropout in the future.

A Study on the Prediction Model for Analysis of Water Quality in Gwangju Stream using Machine Learning Algorithm (머신러닝 학습 알고리즘을 이용한 광주천 수질 분석에 대한 예측 모델 연구)

  • Yu-Jeong Jeong;Jung-Jae Lee
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.19 no.3
    • /
    • pp.531-538
    • /
    • 2024
  • While the importance of the water quality environment is being emphasized, the water quality index for improving the water quality of urban rivers in Gwangju Metropolitan City is an important factor affecting the aquatic ecosystem and requires accurate prediction. In this paper, the XGBoost and LightGBM machine learning algorithms were used to compare the performance of the water quality inspection items of the downstream Pyeongchon Bridge and upstream BanghakBr_Gwangjucheon1 water systems, which are important points of Gwangju Stream, as a result of statistical verification, three water quality indicators, Nitrogen(TN), Nitrate(NO3), and Ammonia amount(NH3) were predicted, and the performance of the predictive model was evaluated by using RMSE, a regression model evaluation index. As a result of comparing the performance after cross-validation by implementing individual models for each water system, the XGBoost model showed excellent predictive ability.

Model Interpretation through LIME and SHAP Model Sharing (LIME과 SHAP 모델 공유에 의한 모델 해석)

  • Yong-Gil Kim
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.24 no.2
    • /
    • pp.177-184
    • /
    • 2024
  • In the situation of increasing data at fast speed, we use all kinds of complex ensemble and deep learning algorithms to get the highest accuracy. It's sometimes questionable how these models predict, classify, recognize, and track unknown data. Accomplishing this technique and more has been and would be the goal of intensive research and development in the data science community. A variety of reasons, such as lack of data, imbalanced data, biased data can impact the decision rendered by the learning models. Many models are gaining traction for such interpretations. Now, LIME and SHAP are commonly used, in which are two state of the art open source explainable techniques. However, their outputs represent some different results. In this context, this study introduces a coupling technique of LIME and Shap, and demonstrates analysis possibilities on the decisions made by LightGBM and Keras models in classifying a transaction for fraudulence on the IEEE CIS dataset.

Study on Predicting the Designation of Administrative Issue in the KOSDAQ Market Based on Machine Learning Based on Financial Data (머신러닝 기반 KOSDAQ 시장의 관리종목 지정 예측 연구: 재무적 데이터를 중심으로)

  • Yoon, Yanghyun;Kim, Taekyung;Kim, Suyeong
    • Asia-Pacific Journal of Business Venturing and Entrepreneurship
    • /
    • v.17 no.1
    • /
    • pp.229-249
    • /
    • 2022
  • This paper investigates machine learning models for predicting the designation of administrative issues in the KOSDAQ market through various techniques. When a company in the Korean stock market is designated as administrative issue, the market recognizes the event itself as negative information, causing losses to the company and investors. The purpose of this study is to evaluate alternative methods for developing a artificial intelligence service to examine a possibility to the designation of administrative issues early through the financial ratio of companies and to help investors manage portfolio risks. In this study, the independent variables used 21 financial ratios representing profitability, stability, activity, and growth. From 2011 to 2020, when K-IFRS was applied, financial data of companies in administrative issues and non-administrative issues stocks are sampled. Logistic regression analysis, decision tree, support vector machine, random forest, and LightGBM are used to predict the designation of administrative issues. According to the results of analysis, LightGBM with 82.73% classification accuracy is the best prediction model, and the prediction model with the lowest classification accuracy is a decision tree with 71.94% accuracy. As a result of checking the top three variables of the importance of variables in the decision tree-based learning model, the financial variables common in each model are ROE(Net profit) and Capital stock turnover ratio, which are relatively important variables in designating administrative issues. In general, it is confirmed that the learning model using the ensemble had higher predictive performance than the single learning model.

Radio Frequency-based Drone Detection and Classification Using Discrete Fourier Transform and LightGBM

  • Ki-Hyeon Sung;Soo-Jin Lee
    • Journal of the Korea Society of Computer and Information
    • /
    • v.29 no.10
    • /
    • pp.59-68
    • /
    • 2024
  • In this study, we proposed an efficient model that can detect and classify the drones and related devices based on radio frequency signals. In order to increase the applicability in the battlefield, proposed model was designed to be lightweight, to ensure rapid detection and high detection accuracy. Data preprocessing was performed by applying a Discrete Fourier Transform (DFT) that is faster than Hilbert-Huang Transform (HHT). We adopted the LightGBM model as the learning model, which can be easily used by non-professionals and guarantees excellent performance in terms of classification speed and accuracy. CardRF dataset was used to verify the performance of the proposed model. As a result of the experiment, the accuracy of 3 classes classification for detecting and classifying drones, WiFi, and Bluetooth device was 99.63% when the number of sample points was set to 100k and 99.40% when set to 500k during the data preprocessing with DFT. And, in the 10 classes classification for 6 drones, 2 Bluetooth devices, and 2 WiFi devices, the accuracy was 95.65% for 100k and 96.83% for 500k, confirming significantly improved detection performance compared to previous studies.