• Title/Summary/Keyword: 랜덤 포리스트

Search Result 9, Processing Time 0.027 seconds

Medical Image Retrieval using Bag-of-Feature and Random Forest Classifier (Bag-of-Feature 특징과 랜덤 포리스트를 이용한 의료영상 검색 기법)

  • Son, JungEun;Kwak, JunYoung;Ko, ByoungChul;Nam, JaeYeal
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2012.11a
    • /
    • pp.601-603
    • /
    • 2012
  • 본 논문에서는 의료영상의 특성을 반영하여 영상의 그래디언트 방향 값을 특징으로 하는 Oriented Center Symmetric Local Binary Patterns (OCS-LBP) 특징을 개발하고 추출된 특징 값에 대해 차원을 줄이고 의미 있는 특징 단위로 재 생성하기 위해 Bag-of-Feature (BoF)를 적용하였다. 검색을 위해서는 기존의 영상 검색 방법과는 다르게, 학습 영상을 이용하여 랜덤 포리스트 (Random Forest)를 사전에 학습시켜 데이터베이스 영상을 N 개의 클래스로 자동 분류 시키고, 질의로 입력된 영상을 같은 방법으로 랜덤 포리스트에 적용하여 상위 확률 값을 갖는 2 개의 클래스에서만 K-nearest neighbor 방법으로 유사 영상을 검색결과로 제시하는 새로운 영상검색 방법을 제시하였다. 실험결과에서 본 논문의 우수성을 증명하기 위해 일반적인 유사성 측정 방법과 랜덤 포리스트를 이용한 방법의 검색 성능 및 시간을 비교하였고, 검색 성능과 시간 면에서 상대적으로 매우 우수한 성능을 보여줌을 증명하였다.

Comparison of data mining methods with daily lens data (데일리 렌즈 데이터를 사용한 데이터마이닝 기법 비교)

  • Seok, Kyungha;Lee, Taewoo
    • Journal of the Korean Data and Information Science Society
    • /
    • v.24 no.6
    • /
    • pp.1341-1348
    • /
    • 2013
  • To solve the classification problems, various data mining techniques have been applied to database marketing, credit scoring and market forecasting. In this paper, we compare various techniques such as bagging, boosting, LASSO, random forest and support vector machine with the daily lens transaction data. The classical techniques-decision tree, logistic regression-are used too. The experiment shows that the random forest has a little smaller misclassification rate and standard error than those of other methods. The performance of the SVM is good in the sense of misclassfication rate and bad in the sense of standard error. Taking the model interpretation and computing time into consideration, we conclude that the LASSO gives the best result.

Feature Selection for Classification of Mass Spectrometric Proteomic Data Using Random Forest (단백체 스펙트럼 데이터의 분류를 위한 랜덤 포리스트 기반 특성 선택 알고리즘)

  • Ohn, Syng-Yup;Chi, Seung-Do;Han, Mi-Young
    • Journal of the Korea Society for Simulation
    • /
    • v.22 no.4
    • /
    • pp.139-147
    • /
    • 2013
  • This paper proposes a novel method for feature selection for mass spectrometric proteomic data based on Random Forest. The method includes an effective preprocessing step to filter a large amount of redundant features with high correlation and applies a tournament strategy to get an optimal feature subset. Experiments on three public datasets, Ovarian 4-3-02, Ovarian 7-8-02 and Prostate shows that the new method achieves high performance comparing with widely used methods and balanced rate of specificity and sensitivity.

Prediction of the Movement Directions of Index and Stock Prices Using Extreme Gradient Boosting (익스트림 그라디언트 부스팅을 이용한 지수/주가 이동 방향 예측)

  • Kim, HyoungDo
    • The Journal of the Korea Contents Association
    • /
    • v.18 no.9
    • /
    • pp.623-632
    • /
    • 2018
  • Both investors and researchers are attentive to the prediction of stock price movement directions since the accurate prediction plays an important role in strategic decision making on stock trading. According to previous studies, taken together, one can see that different factors are considered depending on stock markets and prediction periods. This paper aims to analyze what data mining techniques show better performance with some representative index and stock price datasets in the Korea stock market. In particular, extreme gradient boosting technique, proving itself to be the fore-runner through recent open competitions, is applied to the prediction problem. Its performance has been analyzed in comparison with other data mining techniques reported good in the prediction of stock price movement directions such as random forests, support vector machines, and artificial neural networks. Through experiments with the index/price datasets of 12 years, it is identified that the gradient boosting technique is the best in predicting the movement directions after 1 to 4 days with a few partial equivalence to the other techniques.

A Survival Prediction Model of Rats in Uncontrolled Acute Hemorrhagic Shock Using the Random Forest Classifier (랜덤 포리스트를 이용한 비제어 급성 출혈성 쇼크의 흰쥐에서의 생존 예측)

  • Choi, J.Y.;Kim, S.K.;Koo, J.M.;Kim, D.W.
    • Journal of Biomedical Engineering Research
    • /
    • v.33 no.3
    • /
    • pp.148-154
    • /
    • 2012
  • Hemorrhagic shock is a primary cause of deaths resulting from injury in the world. Although many studies have tried to diagnose accurately hemorrhagic shock in the early stage, such attempts were not successful due to compensatory mechanisms of humans. The objective of this study was to construct a survival prediction model of rats in acute hemorrhagic shock using a random forest (RF) model. Heart rate (HR), mean arterial pressure (MAP), respiration rate (RR), lactate concentration (LC), and peripheral perfusion (PP) measured in rats were used as input variables for the RF model and its performance was compared with that of a logistic regression (LR) model. Before constructing the models, we performed 5-fold cross validation for RF variable selection, and forward stepwise variable selection for the LR model to examine which variables were important for the models. For the LR model, sensitivity, specificity, accuracy, and area under the receiver operating characteristic curve (ROC-AUC) were 0.83, 0.95, 0.88, and 0.96, respectively. For the RF models, sensitivity, specificity, accuracy, and AUC were 0.97, 0.95, 0.96, and 0.99, respectively. In conclusion, the RF model was superior to the LR model for survival prediction in the rat model.

A Comparison of Ensemble Methods Combining Resampling Techniques for Class Imbalanced Data (데이터 전처리와 앙상블 기법을 통한 불균형 데이터의 분류모형 비교 연구)

  • Leea, Hee-Jae;Lee, Sungim
    • The Korean Journal of Applied Statistics
    • /
    • v.27 no.3
    • /
    • pp.357-371
    • /
    • 2014
  • There are many studies related to imbalanced data in which the class distribution is highly skewed. To address the problem of imbalanced data, previous studies deal with resampling techniques which correct the skewness of the class distribution in each sampled subset by using under-sampling, over-sampling or hybrid-sampling such as SMOTE. Ensemble methods have also alleviated the problem of class imbalanced data. In this paper, we compare around a dozen algorithms that combine the ensemble methods and resampling techniques based on simulated data sets generated by the Backbone model, which can handle the imbalance rate. The results on various real imbalanced data sets are also presented to compare the effectiveness of algorithms. As a result, we highly recommend the resampling technique combining ensemble methods for imbalanced data in which the proportion of the minority class is less than 10%. We also find that each ensemble method has a well-matched sampling technique. The algorithms which combine bagging or random forest ensembles with random undersampling tend to perform well; however, the boosting ensemble appears to perform better with over-sampling. All ensemble methods combined with SMOTE outperform in most situations.

Application of machine learning models for estimating house price (단독주택가격 추정을 위한 기계학습 모형의 응용)

  • Lee, Chang Ro;Park, Key Ho
    • Journal of the Korean Geographical Society
    • /
    • v.51 no.2
    • /
    • pp.219-233
    • /
    • 2016
  • In social science fields, statistical models are used almost exclusively for causal explanation, and explanatory modeling has been a mainstream until now. In contrast, predictive modeling has been rare in the fields. Hence, we focus on constructing the predictive non-parametric model, instead of the explanatory model. Gangnam-gu, Seoul was chosen as a study area and we collected single-family house sales data sold between 2011 and 2014. We applied non-parametric models proposed in machine learning area including generalized additive model(GAM), random forest, multivariate adaptive regression splines(MARS) and support vector machines(SVM). Models developed recently such as MARS and SVM were found to be superior in predictive power for house price estimation. Finally, spatial autocorrelation was accounted for in the non-parametric models additionally, and the result showed that their predictive power was enhanced further. We hope that this study will prompt methodology for property price estimation to be extended from traditional parametric models into non-parametric ones.

  • PDF

Network Classification of P2P Traffic with Various Classification Methods (다양한 분류기법을 이용한 네트워크상의 P2P 데이터 분류실험)

  • Han, Seokwan;Hwang, Jinsoo
    • The Korean Journal of Applied Statistics
    • /
    • v.28 no.1
    • /
    • pp.1-8
    • /
    • 2015
  • Security has become an issue due to the rapid increases in internet traffic data network. Especially P2P traffic data poses a great challenge to network systems administrators. Preemptive measures are necessary for network quality of service(QoS) and efficient resource management like blocking suspicious traffic data. Deep packet inspection(DPI) is the most exact way to detect an intrusion but it may pose a private security problem that requires time. We used several machine learning methods to compare the performance in classifying network traffic data accurately over time. The Random Forest method shows an excellent performance in both accuracy and time.

A Comparative Evaluation of Multiple Meteorological Datasets for the Rice Yield Prediction at the County Level in South Korea (우리나라 시군단위 벼 수확량 예측을 위한 다종 기상자료의 비교평가)

  • Cho, Subin;Youn, Youjeong;Kim, Seoyeon;Jeong, Yemin;Kim, Gunah;Kang, Jonggu;Kim, Kwangjin;Cho, Jaeil;Lee, Yangwon
    • Korean Journal of Remote Sensing
    • /
    • v.37 no.2
    • /
    • pp.337-357
    • /
    • 2021
  • Because the growth of paddy rice is affected by meteorological factors, the selection of appropriate meteorological variables is essential to build a rice yield prediction model. This paper examines the suitability of multiple meteorological datasets for the rice yield modeling in South Korea, 1996-2019, and a hindcast experiment for rice yield using a machine learning method by considering the nonlinear relationships between meteorological variables and the rice yield. In addition to the ASOS in-situ observations, we used CRU-JRA ver. 2.1 and ERA5 reanalysis. From the multiple meteorological datasets, we extracted the four common variables (air temperature, relative humidity, solar radiation, and precipitation) and analyzed the characteristics of each data and the associations with rice yields. CRU-JRA ver. 2.1 showed an overall agreement with the other datasets. While relative humidity had a rare relationship with rice yields, solar radiation showed a somewhat high correlation with rice yields. Using the air temperature, solar radiation, and precipitation of July, August, and September, we built a random forest model for the hindcast experiments of rice yields. The model with CRU-JRA ver. 2.1 showed the best performance with a correlation coefficient of 0.772. The solar radiation in the prediction model had the most significant importance among the variables, which is in accordance with the generic agricultural knowledge. This paper has an implication for selecting from multiple meteorological datasets for rice yield modeling.