• Title/Summary/Keyword: tree-based models

Search Result 437, Processing Time 0.027 seconds

Using Data Mining Techniques to Predict Win-Loss in Korean Professional Baseball Games (데이터마이닝을 활용한 한국프로야구 승패예측모형 수립에 관한 연구)

  • Oh, Younhak;Kim, Han;Yun, Jaesub;Lee, Jong-Seok
    • Journal of Korean Institute of Industrial Engineers
    • /
    • v.40 no.1
    • /
    • pp.8-17
    • /
    • 2014
  • In this research, we employed various data mining techniques to build predictive models for win-loss prediction in Korean professional baseball games. The historical data containing information about players and teams was obtained from the official materials that are provided by the KBO website. Using the collected raw data, we additionally prepared two more types of dataset, which are in ratio and binary format respectively. Dividing away-team's records by the records of the corresponding home-team generated the ratio dataset, while the binary dataset was obtained by comparing the record values. We applied seven classification techniques to three (raw, ratio, and binary) datasets. The employed data mining techniques are decision tree, random forest, logistic regression, neural network, support vector machine, linear discriminant analysis, and quadratic discriminant analysis. Among 21(= 3 datasets${\times}$7 techniques) prediction scenarios, the most accurate model was obtained from the random forest technique based on the binary dataset, which prediction accuracy was 84.14%. It was also observed that using the ratio and the binary dataset helped to build better prediction models than using the raw data. From the capability of variable selection in decision tree, random forest, and stepwise logistic regression, we found that annual salary, earned run, strikeout, pitcher's winning percentage, and four balls are important winning factors of a game. This research is distinct from existing studies in that we used three different types of data and various data mining techniques for win-loss prediction in Korean professional baseball games.

Automatic Test Case Generation Through 1-to-1 Requirement Modeling (1대1 요구사항 모델링을 통한 테스트 케이스 자동 생성)

  • Oh, Jung-Sup;Choi, Kyung-Hee;Jung, Gi-Hyun
    • The KIPS Transactions:PartD
    • /
    • v.17D no.1
    • /
    • pp.41-52
    • /
    • 2010
  • A relation between generated test cases and an original requirement is important, but it becomes very complex because a relation between requirement models and requirements are m-to-n in automatic test case generation based on models. In this paper, I suggest automatic generation technique for REED (REquirement EDitor), 1-to-1 requirement modeling tool. Test cases are generated though 3 steps, Coverage Target Generation, IORT (Input Output Relation Tree)Generation, and Test Cases Generation. All these steps are running automatically. The generated test cases can be generated from a single requirement. As a result of applying to three real commercial systems, there are 5566 test cases for the Temperature Controller, 3757 test cases for Bus Card Terminal, and 4611 test cases for Excavator Controller.

Analysis of the Factors and Patterns Associated with Death in Aircraft Accidents and Incidents Using Data Mining Techniques (데이터 마이닝 기법을 활용한 항공기 사고 및 준사고로 인한 사망 발생 요인 및 패턴 분석)

  • Kim, Jeong-Hun;Kim, Tae-Un;Yoo, Dong-Hee
    • Journal of Digital Convergence
    • /
    • v.17 no.9
    • /
    • pp.79-88
    • /
    • 2019
  • This study analyzes the influential factors and patterns associated with death from aircraft accidents and incidents using data mining techniques. To this end, we used two datasets for aircraft accidents and incidents, one from the National Transportation Safety Board (NTSB) and the other from the Federal Aviation Administration (FAA). We developed our prediction models using the decision tree classifier to predict death from aircraft accidents or aircraft incidents and thereby derive the main cause factors and patterns that can cause death based on these prediction models. In the NTSB data, deaths occurred frequently when the aircraft was destroyed or people were performing dangerous missions or maneuver. In the FAA data, deaths were mainly caused by pilots who were less skilled or less qualified when their aircraft were partially destroyed. Several death-related patterns were also found for parachute jumping and aircraft ascending and descending phases. Using the derived patterns, we proposed helpful strategies to prevent death from the aircraft accidents or incidents.

Comparative Analysis of the Binary Classification Model for Improving PM10 Prediction Performance (PM10 예측 성능 향상을 위한 이진 분류 모델 비교 분석)

  • Jung, Yong-Jin;Lee, Jong-Sung;Oh, Chang-Heon
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.1
    • /
    • pp.56-62
    • /
    • 2021
  • High forecast accuracy is required as social issues on particulate matter increase. Therefore, many attempts are being made using machine learning to increase the accuracy of particulate matter prediction. However, due to problems with the distribution of imbalance in the concentration and various characteristics of particulate matter, the learning of prediction models is not well done. In this paper, to solve these problems, a binary classification model was proposed to predict the concentration of particulate matter needed for prediction by dividing it into two classes based on the value of 80㎍/㎥. Four classification algorithms were utilized for the binary classification of PM10. Classification algorithms used logistic regression, decision tree, SVM, and MLP. As a result of performance evaluation through confusion matrix, the MLP model showed the highest binary classification performance with 89.98% accuracy among the four models.

Comparison of the Machine Learning Models Predicting Lithium-ion Battery Capacity for Remaining Useful Life Estimation (리튬이온 배터리 수명추정을 위한 용량예측 머신러닝 모델의 성능 비교)

  • Yoo, Sangwoo;Shin, Yongbeom;Shin, Dongil
    • Journal of the Korean Institute of Gas
    • /
    • v.24 no.6
    • /
    • pp.91-97
    • /
    • 2020
  • Lithium-ion batteries (LIBs) have a longer lifespan, higher energy density, and lower self-discharge rates than other batteries, therefore, they are preferred as an Energy Storage System (ESS). However, during years 2017-2019, 28 ESS fire accidents occurred in Korea, and accurate capacity estimation of LIB is essential to ensure safety and reliability during operations. In this study, data-driven modeling that predicts capacity changes according to the charging cycle of LIB was conducted, and developed models were compared their performance for the selection of the optimal machine learning model, which includes the Decision Tree, Ensemble Learning Method, Support Vector Regression, and Gaussian Process Regression (GPR). For model training, lithium battery test data provided by NASA was used, and GPR showed the best prediction performance. Based on this study, we will develop an enhanced LIB capacity prediction and remaining useful life estimation model through additional data training, and improve the performance of anomaly detection and monitoring during operations, enabling safe and stable ESS operations.

A Study on the Big Data Analysis and Predictive Models for Quality Issues in Defense C5ISR (국방 C5ISR 분야 품질문제의 빅데이터 분석 및 예측 모델에 대한 연구)

  • Hyoung Jo Huh;Sujin Ko;Seung Hyun Baek
    • Journal of Korean Society for Quality Management
    • /
    • v.51 no.4
    • /
    • pp.551-571
    • /
    • 2023
  • Purpose: The purpose of this study is to propose useful suggestions by analyzing the causal effect relationship between the failure rate of quality and the process variables in the C5ISR domain of the defense industry. Methods: The collected data through the in house Systems were analyzed using Big data analysis. Data analysis between quality data and A/S history data was conducted using the CRISP-DM(Cross-Industry Standard Process for Data Mining) analysis process. Results: The results of this study are as follows: After evaluating the performance of candidate models for the influence of inspection data and A/S history data, logistic regression was selected as the final model because it performed relatively well compared to the decision tree with an accuracy of 82%/67% and an AUC of 0.66/0.57. Based on this model, we estimated the coefficients using 'R', a data analysis tool, and found that a specific variable(continuous maximum discharge current time) had a statistically significant effect on the A/S quality failure rate and it was analysed that 82% of the failure rate could be predicted. Conclusion: As the first case of applying big data analysis to quality issues in the defense industry, this study confirms that it is possible to improve the market failure rates of defense products by focusing on the measured values of the main causes of failures derived through the big data analysis process, and identifies improvements, such as the number of data samples and data collection limitations, to be addressed in subsequent studies for a more reliable analysis model.

Fuzzy Reliability Analysis Models for Maintenance of Bridge Structure Systems (교량구조시스템의 유지관리를 위한 퍼지 신뢰성해석 모델)

  • 김종길;손용우;이증빈;이채규;안영기
    • Proceedings of the Computational Structural Engineering Institute Conference
    • /
    • 2003.10a
    • /
    • pp.103-114
    • /
    • 2003
  • This paper aims to propose a method that helps maintenance engineers to evaluate the damage states of bridge structure systems by using a Fuzzy Fault Tree Analysis. It may be stated that Fuzzy Fault Tree Analysis may be very useful for the systematic and rational fuzzy reliability assessment for real bridge structure systems problems because the approach is able to effectively deal with all the related bridge structural element damages in terms of the linguistic variables that incorporate systematically experts experiences and subjective judgement. This paper considers these uncertainties by providing a fuzzy reliability-based framework and shows that the identification of the optimum maintenance scenario is a straightforward process. This is achieved by using a computer program for LIFETIME. This program can consider the effects of various types of actions on the fuzzy reliability index profile of a deteriorating structures. Only the effect of maintenance interventions is considered in this study. However. any environmental or mechanical action affecting the fuzzy reliability index profile can be considered in LIFETIME. Numerical examples of deteriorating bridges are presented to illustrate the capability of the proposed approach. Further development and implementation of this approach are recommended for future research.

  • PDF

Exploring Machine Learning Classifiers for Breast Cancer Classification

  • Inayatul Haq;Tehseen Mazhar;Hinna Hafeez;Najib Ullah;Fatma Mallek;Habib Hamam
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.18 no.4
    • /
    • pp.860-880
    • /
    • 2024
  • Breast cancer is a major health concern affecting women and men globally. Early detection and accurate classification of breast cancer are vital for effective treatment and survival of patients. This study addresses the challenge of accurately classifying breast tumors using machine learning classifiers such as MLP, AdaBoostM1, logit Boost, Bayes Net, and the J48 decision tree. The research uses a dataset available publicly on GitHub to assess the classifiers' performance and differentiate between the occurrence and non-occurrence of breast cancer. The study compares the 10-fold and 5-fold cross-validation effectiveness, showing that 10-fold cross-validation provides superior results. Also, it examines the impact of varying split percentages, with a 66% split yielding the best performance. This shows the importance of selecting appropriate validation techniques for machine learning-based breast tumor classification. The results also indicate that the J48 decision tree method is the most accurate classifier, providing valuable insights for developing predictive models for cancer diagnosis and advancing computational medical research.

Development of a Detection Model for the Companies Designated as Administrative Issue in KOSDAQ Market (KOSDAQ 시장의 관리종목 지정 탐지 모형 개발)

  • Shin, Dong-In;Kwahk, Kee-Young
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.3
    • /
    • pp.157-176
    • /
    • 2018
  • The purpose of this research is to develop a detection model for companies designated as administrative issue in KOSDAQ market using financial data. Administration issue designates the companies with high potential for delisting, which gives them time to overcome the reasons for the delisting under certain restrictions of the Korean stock market. It acts as an alarm to inform investors and market participants of which companies are likely to be delisted and warns them to make safe investments. Despite this importance, there are relatively few studies on administration issues prediction model in comparison with the lots of studies on bankruptcy prediction model. Therefore, this study develops and verifies the detection model of the companies designated as administrative issue using financial data of KOSDAQ companies. In this study, logistic regression and decision tree are proposed as the data mining models for detecting administrative issues. According to the results of the analysis, the logistic regression model predicted the companies designated as administrative issue using three variables - ROE(Earnings before tax), Cash flows/Shareholder's equity, and Asset turnover ratio, and its overall accuracy was 86% for the validation dataset. The decision tree (Classification and Regression Trees, CART) model applied the classification rules using Cash flows/Total assets and ROA(Net income), and the overall accuracy reached 87%. Implications of the financial indictors selected in our logistic regression and decision tree models are as follows. First, ROE(Earnings before tax) in the logistic detection model shows the profit and loss of the business segment that will continue without including the revenue and expenses of the discontinued business. Therefore, the weakening of the variable means that the competitiveness of the core business is weakened. If a large part of the profits is generated from one-off profit, it is very likely that the deterioration of business management is further intensified. As the ROE of a KOSDAQ company decreases significantly, it is highly likely that the company can be delisted. Second, cash flows to shareholder's equity represents that the firm's ability to generate cash flow under the condition that the financial condition of the subsidiary company is excluded. In other words, the weakening of the management capacity of the parent company, excluding the subsidiary's competence, can be a main reason for the increase of the possibility of administrative issue designation. Third, low asset turnover ratio means that current assets and non-current assets are ineffectively used by corporation, or that asset investment by corporation is excessive. If the asset turnover ratio of a KOSDAQ-listed company decreases, it is necessary to examine in detail corporate activities from various perspectives such as weakening sales or increasing or decreasing inventories of company. Cash flow / total assets, a variable selected by the decision tree detection model, is a key indicator of the company's cash condition and its ability to generate cash from operating activities. Cash flow indicates whether a firm can perform its main activities(maintaining its operating ability, repaying debts, paying dividends and making new investments) without relying on external financial resources. Therefore, if the index of the variable is negative(-), it indicates the possibility that a company has serious problems in business activities. If the cash flow from operating activities of a specific company is smaller than the net profit, it means that the net profit has not been cashed, indicating that there is a serious problem in managing the trade receivables and inventory assets of the company. Therefore, it can be understood that as the cash flows / total assets decrease, the probability of administrative issue designation and the probability of delisting are increased. In summary, the logistic regression-based detection model in this study was found to be affected by the company's financial activities including ROE(Earnings before tax). However, decision tree-based detection model predicts the designation based on the cash flows of the company.

A study on algal bloom forecast system based on hydro-meteorological factors in the mainstream of Nakdong river using machine learning (머신러닝를 이용한 낙동강 본류 구간 수문-기상인자 조류 예보체계 연구)

  • Taewoo Lee;Soojun Kim;Junhyeong Lee;Kyunghun Kim;Hoyong Lee;Duckgil Kim
    • Journal of Wetlands Research
    • /
    • v.26 no.3
    • /
    • pp.245-253
    • /
    • 2024
  • Blue-green algal bloom, or harmful algal bloom has a negative impact on the aquatic ecosystem and purified water supply system due to oxygen depletion in the water body, odor, and secretion of toxic substances in the freshwater ecosystem. This Blue-green algal bloom is expected to increase in intensity and frequency due to the increase in algae's residence time in the water body after the construction of the Nakdong River weir, as well as the increase in surface temperature due to climate change. In this study, in order to respond to the expected increase in green algae phenomenon, an algal bloom forecast system based on hydro-meteorological factors was presented for preemptive response before issuing a algal bloom warning. Through polyserial correlation analysis, the preceding influence periods of temperature and discharge according to the algal bloom forecast level were derived. Using the decision tree classification, a machine learning technique, Classification models for the algal bloom forecast levels based on temperature and discharge of the preceding period were derived. And a algal bloom forecast system based on hydro-meteorological factors was derived based on the results of the decision tree classification models. The proposed algae forecast system based on hydro-meteorological factors can be used as basic research for preemptive response before blue-green algal blooms.