• Title/Summary/Keyword: Tree models

Search Result 758, Processing Time 0.025 seconds

Committee Learning Classifier based on Attribute Value Frequency (속성 값 빈도 기반의 전문가 다수결 분류기)

  • Lee, Chang-Hwan;Jung, In-Chul;Kwon, Young-S.
    • Journal of KIISE:Databases
    • /
    • v.37 no.4
    • /
    • pp.177-184
    • /
    • 2010
  • In these day, many data including sensor, delivery, credit and stock data are generated continuously in massive quantity. It is difficult to learn from these data because they are large in volume and changing fast in their concepts. To handle these problems, learning methods based in sliding window methods over time have been used. But these approaches have a problem of rebuilding models every time new data arrive, which requires a lot of time and cost. Therefore we need very simple incremental learning methods. Bayesian method is an example of these methods but it has a disadvantage which it requries the prior knowledge(probabiltiy) of data. In this study, we propose a learning method based on attribute values. In the proposed method, even though we don't know the prior knowledge(probability) of data, we can apply our new method to data. The main concept of this method is that each attribute value is regarded as an expert learner, summing up the expert learners lead to better results. Experimental results show our learning method learns from data very fast and performs well when compared to current learning methods(decision tree and bayesian).

Affected Model of Indoor Radon Concentrations Based on Lifestyle, Greenery Ratio, and Radon Levels in Groundwater (생활 습관, 주거지 주변 녹지 비율 및 지하수 내 라돈 농도 따른 실내 라돈 농도 영향 모델)

  • Lee, Hyun Young;Park, Ji Hyun;Lee, Cheol-Min;Kang, Dae Ryong
    • Journal of health informatics and statistics
    • /
    • v.42 no.4
    • /
    • pp.309-316
    • /
    • 2017
  • Objectives: Radon and its progeny pose environmental risks as a carcinogen, especially to the lungs. Investigating factors affecting indoor radon concentrations and models thereof are needed to prevent exposure to radon and to reduce indoor radon concentrations. The purpose of this study was to identify factors affecting indoor radon concentration and to construct a comprehensive model thereof. Methods: Questionnaires were administered to obtain data on residential environments, including building materials and life style. Decision tree and structural equation modeling were applied to predict residences at risk for higher radon concentrations and to develop the comprehensive model. Results: Greenery ratio, impermeable layer ratio, residence at ground level, daily ventilation, long-term heating, crack around the measuring device, and bedroom were significantly shown to be predictive factors of higher indoor radon concentrations. Daily ventilation reduced the probability of homes having indoor radon concentrations ${\geq}200Bq/m^3$ by 11.6%. Meanwhile, a greenery ratio ${\geq}65%$ without daily ventilation increased this probability by 15.3% compared to daily ventilation. The constructed model indicated greenery ratio and ventilation rate directly affecting indoor radon concentrations. Conclusions: Our model highlights the combined influences of geographical properties, groundwater, and lifestyle factors of an individual resident on indoor radon concentrations in Korea.

Visual Classification of Wood Knots Using k-Nearest Neighbor and Convolutional Neural Network (k-Nearest Neighbor와 Convolutional Neural Network에 의한 제재목 표면 옹이 종류의 화상 분류)

  • Kim, Hyunbin;Kim, Mingyu;Park, Yonggun;Yang, Sang-Yun;Chung, Hyunwoo;Kwon, Ohkyung;Yeo, Hwanmyeong
    • Journal of the Korean Wood Science and Technology
    • /
    • v.47 no.2
    • /
    • pp.229-238
    • /
    • 2019
  • Various wood defects occur during tree growing or wood processing. Thus, to use wood practically, it is necessary to objectively assess their quality based on the usage requirement by accurately classifying their defects. However, manual visual grading and species classification may result in differences due to subjective decisions; therefore, computer-vision-based image analysis is required for the objective evaluation of wood quality and the speeding up of wood production. In this study, the SIFT+k-NN and CNN models were used to implement a model that automatically classifies knots and analyze its accuracy. Toward this end, a total of 1,172 knot images in various shapes from five domestic conifers were used for learning and validation. For the SIFT+k-NN model, SIFT technology was used to extract properties from the knot images and k-NN was used for the classification, resulting in the classification with an accuracy of up to 60.53% when k-index was 17. The CNN model comprised 8 convolution layers and 3 hidden layers, and its maximum accuracy was 88.09% after 1205 epoch, which was higher than that of the SIFT+k-NN model. Moreover, if there is a large difference in the number of images by knot types, the SIFT+k-NN tended to show a learning biased toward the knot type with a higher number of images, whereas the CNN model did not show a drastic bias regardless of the difference in the number of images. Therefore, the CNN model showed better performance in knot classification. It is determined that the wood knot classification by the CNN model will show a sufficient accuracy in its practical applicability.

Predicting Corporate Bankruptcy using Simulated Annealing-based Random Fores (시뮬레이티드 어니일링 기반의 랜덤 포레스트를 이용한 기업부도예측)

  • Park, Hoyeon;Kim, Kyoung-jae
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.4
    • /
    • pp.155-170
    • /
    • 2018
  • Predicting a company's financial bankruptcy is traditionally one of the most crucial forecasting problems in business analytics. In previous studies, prediction models have been proposed by applying or combining statistical and machine learning-based techniques. In this paper, we propose a novel intelligent prediction model based on the simulated annealing which is one of the well-known optimization techniques. The simulated annealing is known to have comparable optimization performance to the genetic algorithms. Nevertheless, since there has been little research on the prediction and classification of business decision-making problems using the simulated annealing, it is meaningful to confirm the usefulness of the proposed model in business analytics. In this study, we use the combined model of simulated annealing and machine learning to select the input features of the bankruptcy prediction model. Typical types of combining optimization and machine learning techniques are feature selection, feature weighting, and instance selection. This study proposes a combining model for feature selection, which has been studied the most. In order to confirm the superiority of the proposed model in this study, we apply the real-world financial data of the Korean companies and analyze the results. The results show that the predictive accuracy of the proposed model is better than that of the naïve model. Notably, the performance is significantly improved as compared with the traditional decision tree, random forests, artificial neural network, SVM, and logistic regression analysis.

Development of prediction model identifying high-risk older persons in need of long-term care (장기요양 필요 발생의 고위험 대상자 발굴을 위한 예측모형 개발)

  • Song, Mi Kyung;Park, Yeongwoo;Han, Eun-Jeong
    • The Korean Journal of Applied Statistics
    • /
    • v.35 no.4
    • /
    • pp.457-468
    • /
    • 2022
  • In aged society, it is important to prevent older people from being disability needing long-term care. The purpose of this study is to develop a prediction model to discover high-risk groups who are likely to be beneficiaries of Long-Term Care Insurance. This study is a retrospective study using database of National Health Insurance Service (NHIS) collected in the past of the study subjects. The study subjects are 7,724,101, the population over 65 years of age registered for medical insurance. To develop the prediction model, we used logistic regression, decision tree, random forest, and multi-layer perceptron neural network. Finally, random forest was selected as the prediction model based on the performances of models obtained through internal and external validation. Random forest could predict about 90% of the older people in need of long-term care using DB without any information from the assessment of eligibility for long-term care. The findings might be useful in evidencebased health management for prevention services and can contribute to preemptively discovering those who need preventive services in older people.

The Detection of Online Manipulated Reviews Using Machine Learning and GPT-3 (기계학습과 GPT3를 시용한 조작된 리뷰의 탐지)

  • Chernyaeva, Olga;Hong, Taeho
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.4
    • /
    • pp.347-364
    • /
    • 2022
  • Fraudulent companies or sellers strategically manipulate reviews to influence customers' purchase decisions; therefore, the reliability of reviews has become crucial for customer decision-making. Since customers increasingly rely on online reviews to search for more detailed information about products or services before purchasing, many researchers focus on detecting manipulated reviews. However, the main problem in detecting manipulated reviews is the difficulties with obtaining data with manipulated reviews to utilize machine learning techniques with sufficient data. Also, the number of manipulated reviews is insufficient compared with the number of non-manipulated reviews, so the class imbalance problem occurs. The class with fewer examples is under-represented and can hamper a model's accuracy, so machine learning methods suffer from the class imbalance problem and solving the class imbalance problem is important to build an accurate model for detecting manipulated reviews. Thus, we propose an OpenAI-based reviews generation model to solve the manipulated reviews imbalance problem, thereby enhancing the accuracy of manipulated reviews detection. In this research, we applied the novel autoregressive language model - GPT-3 to generate reviews based on manipulated reviews. Moreover, we found that applying GPT-3 model for oversampling manipulated reviews can recover a satisfactory portion of performance losses and shows better performance in classification (logit, decision tree, neural networks) than traditional oversampling models such as random oversampling and SMOTE.

Water Level Prediction on the Golok River Utilizing Machine Learning Technique to Evaluate Flood Situations

  • Pheeranat Dornpunya;Watanasak Supaking;Hanisah Musor;Oom Thaisawasdi;Wasukree Sae-tia;Theethut Khwankeerati;Watcharaporn Soyjumpa
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2023.05a
    • /
    • pp.31-31
    • /
    • 2023
  • During December 2022, the northeast monsoon, which dominates the south and the Gulf of Thailand, had significant rainfall that impacted the lower southern region, causing flash floods, landslides, blustery winds, and the river exceeding its bank. The Golok River, located in Narathiwat, divides the border between Thailand and Malaysia was also affected by rainfall. In flood management, instruments for measuring precipitation and water level have become important for assessing and forecasting the trend of situations and areas of risk. However, such regions are international borders, so the installed measuring telemetry system cannot measure the rainfall and water level of the entire area. This study aims to predict 72 hours of water level and evaluate the situation as information to support the government in making water management decisions, publicizing them to relevant agencies, and warning citizens during crisis events. This research is applied to machine learning (ML) for water level prediction of the Golok River, Lan Tu Bridge area, Sungai Golok Subdistrict, Su-ngai Golok District, Narathiwat Province, which is one of the major monitored rivers. The eXtreme Gradient Boosting (XGBoost) algorithm, a tree-based ensemble machine learning algorithm, was exploited to predict hourly water levels through the R programming language. Model training and testing were carried out utilizing observed hourly rainfall from the STH010 station and hourly water level data from the X.119A station between 2020 and 2022 as main prediction inputs. Furthermore, this model applies hourly spatial rainfall forecasting data from Weather Research and Forecasting and Regional Ocean Model System models (WRF-ROMs) provided by Hydro-Informatics Institute (HII) as input, allowing the model to predict the hourly water level in the Golok River. The evaluation of the predicted performances using the statistical performance metrics, delivering an R-square of 0.96 can validate the results as robust forecasting outcomes. The result shows that the predicted water level at the X.119A telemetry station (Golok River) is in a steady decline, which relates to the input data of predicted 72-hour rainfall from WRF-ROMs having decreased. In short, the relationship between input and result can be used to evaluate flood situations. Here, the data is contributed to the Operational support to the Special Water Resources Management Operation Center in Southern Thailand for flood preparedness and response to make intelligent decisions on water management during crisis occurrences, as well as to be prepared and prevent loss and harm to citizens.

  • PDF

Performance Comparison of Machine Learning based Prediction Models for University Students Dropout (머신러닝 기반 대학생 중도 탈락 예측 모델의 성능 비교)

  • Seok-Bong Jeong;Du-Yon Kim
    • Journal of the Korea Society for Simulation
    • /
    • v.32 no.4
    • /
    • pp.19-26
    • /
    • 2023
  • The increase in the dropout rate of college students nationwide has a serious negative impact on universities and society as well as individual students. In order to proactive identify students at risk of dropout, this study built a decision tree, random forest, logistic regression, and deep learning-based dropout prediction model using academic data that can be easily obtained from each university's academic management system. Their performances were subsequently analyzed and compared. The analysis revealed that while the logistic regression-based prediction model exhibited the highest recall rate, its f-1 value and ROC-AUC (Receiver Operating Characteristic - Area Under the Curve) value were comparatively lower. On the other hand, the random forest-based prediction model demonstrated superior performance across all other metrics except recall value. In addition, in order to assess model performance over distinct prediction periods, we divided these periods into short-term (within one semester), medium-term (within two semesters), and long-term (within three semesters). The results underscored that the long-term prediction yielded the highest predictive efficacy. Through this study, each university is expected to be able to identify students who are expected to be dropped out early, reduce the dropout rate through intensive management, and further contribute to the stabilization of university finances.

Spatial Distribution Patterns and Prediction of Hotspot Area for Endangered Herpetofauna Species in Korea (국내 멸종위기양서·파충류의 공간적 분포형태와 주요 분포지역 예측에 대한 연구)

  • Do, Min Seock;Lee, Jin-Won;Jang, Hoan-Jin;Kim, Dae-In;Park, Jinwoo;Yoo, Jeong-Chil
    • Korean Journal of Environment and Ecology
    • /
    • v.31 no.4
    • /
    • pp.381-396
    • /
    • 2017
  • Understanding species distribution plays an important role in conservation as well as evolutionary biology. In this study, we applied a species distribution model to predict hotspot areas and habitat characteristics for endangered herpetofauna species in South Korea: the Korean Crevice Salamander (Karsenia koreana), Suweon-tree frog (Hyla suweonensis), Gold-spotted pond frog (Pelophylax chosenicus), Narrow-mouthed toad (Kaloula borealis), Korean ratsnake (Elaphe schrenckii), Mongolian racerunner (Eremias argus), Reeve's turtle (Mauremys reevesii) and Soft-shelled turtle (Pelodiscus sinensis). The Kori salamander (Hynobius yangi) and Black-headed snake (Sibynophis chinensis) were excluded from the analysis due to insufficient sample size. The results showed that the altitude was the most important environmental variable for their distribution, and the altitude at which these species were distributed correlated with the climate of that region. The predicted distribution area derived from the species distribution modelling adequately reflected the observation site used in this study as well as those reported in preceding studies. The average AUC value of the eigh species was relatively high ($0.845{\pm}0.08$), while the average omission rate value was relatively low ($0.087{\pm}0.01$). Therefore, the species overlaying model created for the endangered species is considered successful. When merging the distribution models, it was shown that five species shared their habitats in the coastal areas of Gyeonggi-do and Chungcheongnam-do, which are the western regions of the Korean Peninsula. Therefore, we suggest that protection should be a high priority in these area, and our overall results may serve as essential and fundamental data for the conservation of endangered amphibian and reptiles in Korea.

Development of A Two-Variable Spatial Leaf Photosynthetic Model of Irwin Mango Grown in Greenhouse (온실재배 어윈 망고의 위치 별 2변수 엽 광합성 모델 개발)

  • Jung, Dae Ho;Shin, Jong Hwa;Cho, Young Yeol;Son, Jung Eek
    • Journal of Bio-Environment Control
    • /
    • v.24 no.3
    • /
    • pp.161-166
    • /
    • 2015
  • To determine the adequate levels of light intensity and $CO_2$ concentration for mango grown in greenhouses, quantitative measurements of photosynthetic rates at various leaf positions in the tree are required. The objective of this study was to develop two-variable leaf photosynthetic models of Irwin mango (Mangifera indica L. cv. Irwin) using light intensity and $CO_2$ concentration at different leaf positions. Leaf photosynthetic rates at different positions (top, middle, and bottom) were measured by a leaf photosynthesis analyzer at light intensities (0, 50, 100, 200, 300, 400, 600, and $800{\mu}mol{\cdot}m^{-2}{\cdot}s^{-1}$) with $CO_2$ concentrations (100, 400, 800, 1200, and $1600{\mu}mol{\cdot}mol^{-1}$). The two-variable model consisted of the two leaf photosynthetic models expressed as negative exponential functions for light intensity and $CO_2$ concentrations, respectively. The photosynthetic rates of top leaves were saturated at a light intensity of $400{\mu}mol{\cdot}^{-2}{\cdot}s^{-1}$, while those of middle and bottom leaves saturated at $200{\mu}mol{\cdot}^{-2}{\cdot}s^{-1}$. The leaf photosynthetic rates did not reach the saturation point at a $CO_2$ concentration of $1600imolmol^{-1}$. In validation of the model, the estimated photosynthetic rates at top and bottom leaves showed better agreements with the measured ones than the middle leaves. It is expected that the optimal conditions of light intensity and $CO_2$ concentration can be determined for maximizing photosynthetic rates of Irwin mango grown in greenhouses by using the two-variable model.