• 제목/요약/키워드: Random Forest Classification

검색결과 295건 처리시간 0.02초

A Clustering Approach for Feature Selection in Microarray Data Classification Using Random Forest

  • Aydadenta, Husna;Adiwijaya, Adiwijaya
    • Journal of Information Processing Systems
    • /
    • 제14권5호
    • /
    • pp.1167-1175
    • /
    • 2018
  • Microarray data plays an essential role in diagnosing and detecting cancer. Microarray analysis allows the examination of levels of gene expression in specific cell samples, where thousands of genes can be analyzed simultaneously. However, microarray data have very little sample data and high data dimensionality. Therefore, to classify microarray data, a dimensional reduction process is required. Dimensional reduction can eliminate redundancy of data; thus, features used in classification are features that only have a high correlation with their class. There are two types of dimensional reduction, namely feature selection and feature extraction. In this paper, we used k-means algorithm as the clustering approach for feature selection. The proposed approach can be used to categorize features that have the same characteristics in one cluster, so that redundancy in microarray data is removed. The result of clustering is ranked using the Relief algorithm such that the best scoring element for each cluster is obtained. All best elements of each cluster are selected and used as features in the classification process. Next, the Random Forest algorithm is used. Based on the simulation, the accuracy of the proposed approach for each dataset, namely Colon, Lung Cancer, and Prostate Tumor, achieved 85.87%, 98.9%, and 89% accuracy, respectively. The accuracy of the proposed approach is therefore higher than the approach using Random Forest without clustering.

Construction of an Internet of Things Industry Chain Classification Model Based on IRFA and Text Analysis

  • Zhimin Wang
    • Journal of Information Processing Systems
    • /
    • 제20권2호
    • /
    • pp.215-225
    • /
    • 2024
  • With the rapid development of Internet of Things (IoT) and big data technology, a large amount of data will be generated during the operation of related industries. How to classify the generated data accurately has become the core of research on data mining and processing in IoT industry chain. This study constructs a classification model of IoT industry chain based on improved random forest algorithm and text analysis, aiming to achieve efficient and accurate classification of IoT industry chain big data by improving traditional algorithms. The accuracy, precision, recall, and AUC value size of the traditional Random Forest algorithm and the algorithm used in the paper are compared on different datasets. The experimental results show that the algorithm model used in this paper has better performance on different datasets, and the accuracy and recall performance on four datasets are better than the traditional algorithm, and the accuracy performance on two datasets, P-I Diabetes and Loan Default, is better than the random forest model, and its final data classification results are better. Through the construction of this model, we can accurately classify the massive data generated in the IoT industry chain, thus providing more research value for the data mining and processing technology of the IoT industry chain.

Supervised Learning-Based Collaborative Filtering Using Market Basket Data for the Cold-Start Problem

  • Hwang, Wook-Yeon;Jun, Chi-Hyuck
    • Industrial Engineering and Management Systems
    • /
    • 제13권4호
    • /
    • pp.421-431
    • /
    • 2014
  • The market basket data in the form of a binary user-item matrix or a binary item-user matrix can be modelled as a binary classification problem. The binary logistic regression approach tackles the binary classification problem, where principal components are predictor variables. If users or items are sparse in the training data, the binary classification problem can be considered as a cold-start problem. The binary logistic regression approach may not function appropriately if the principal components are inefficient for the cold-start problem. Assuming that the market basket data can also be considered as a special regression problem whose response is either 0 or 1, we propose three supervised learning approaches: random forest regression, random forest classification, and elastic net to tackle the cold-start problem, comparing the performance in a variety of experimental settings. The experimental results show that the proposed supervised learning approaches outperform the conventional approaches.

랜덤포레스트를 이용한 국내 학술지 논문의 자동분류에 관한 연구 (An Analytical Study on Automatic Classification of Domestic Journal articles Using Random Forest)

  • 김판준
    • 정보관리학회지
    • /
    • 제36권2호
    • /
    • pp.57-77
    • /
    • 2019
  • 대표적인 앙상블 기법으로서 랜덤포레스트(RF)를 문헌정보학 분야의 학술지 논문에 대한 자동분류에 적용하였다. 특히, 국내 학술지 논문에 주제 범주를 자동 할당하는 분류 성능 측면에서 트리 수, 자질선정, 학습집합 크기 등 주요 요소들에 대한 다각적인 실험을 수행하였다. 이를 통해, 실제 환경의 불균형 데이터세트(imbalanced dataset)에 대하여 랜덤포레스트(RF)의 성능을 최적화할 수 있는 방안을 모색하였다. 결과적으로 국내 학술지 논문의 자동분류에서 랜덤포레스트(RF)는 트리 수 구간 100~1000(C)과 카이제곱통계량(CHI)으로 선정한 소규모의 자질집합(10%), 대부분의 학습집합(9~10년)을 사용하는 경우에 가장 좋은 분류 성능을 기대할 수 있는 것으로 나타났다.

An Assessment of a Random Forest Classifier for a Crop Classification Using Airborne Hyperspectral Imagery

  • Jeon, Woohyun;Kim, Yongil
    • 대한원격탐사학회지
    • /
    • 제34권1호
    • /
    • pp.141-150
    • /
    • 2018
  • Crop type classification is essential for supporting agricultural decisions and resource monitoring. Remote sensing techniques, especially using hyperspectral imagery, have been effective in agricultural applications. Hyperspectral imagery acquires contiguous and narrow spectral bands in a wide range. However, large dimensionality results in unreliable estimates of classifiers and high computational burdens. Therefore, reducing the dimensionality of hyperspectral imagery is necessary. In this study, the Random Forest (RF) classifier was utilized for dimensionality reduction as well as classification purpose. RF is an ensemble-learning algorithm created based on the Classification and Regression Tree (CART), which has gained attention due to its high classification accuracy and fast processing speed. The RF performance for crop classification with airborne hyperspectral imagery was assessed. The study area was the cultivated area in Chogye-myeon, Habcheon-gun, Gyeongsangnam-do, South Korea, where the main crops are garlic, onion, and wheat. Parameter optimization was conducted to maximize the classification accuracy. Then, the dimensionality reduction was conducted based on RF variable importance. The result shows that using the selected bands presents an excellent classification accuracy without using whole datasets. Moreover, a majority of selected bands are concentrated on visible (VIS) region, especially region related to chlorophyll content. Therefore, it can be inferred that the phenological status after the mature stage influences red-edge spectral reflectance.

영화 관객 수 예측을 위한 기계학습 기법의 성능 평가 연구 (A Study on the Performance Evaluation of Machine Learning for Predicting the Number of Movie Audiences)

  • 정찬미;민대기
    • 한국전자거래학회지
    • /
    • 제25권2호
    • /
    • pp.49-63
    • /
    • 2020
  • 영화 제작에 막대한 비용이 투입되지만 관객수요는 매우 불확실하기 때문에 개선된 수요예측은 수익 개선을 위한 의사결정의 중요 수단으로 활용될 수 있다. 본 연구에서는 영화의 개봉 후 수요를 예측함에 있어 기계학습 기법의 적용 타당성을 예측 성능의 관점에서 검증하였다. 분석결과를 종합하면 다음과 같다. 첫째, 대안변수에 대한 통계적 검증 결과 기본 영화 특성(감독, 배우)과 함께 개봉 후 2주차까지의 스크린수, 상영횟수, 관객수, 주요 배우에 대한 관심도 등 시계열 자료가 수요예측에 유의미한 것을 확인하였다. 둘째, Random Forest Classifier와 SVM(Support Vector Machine) 등 분류 기반 기계학습 기법과 Random Forest Regressor와 k-NN Regressor와 같은 회귀모형 기반 기계학습 기법에 적용하여 예측 성능을 평가한 결과, Random Forest 기법이 우수한 결과를 보였다. 셋째, 누적관객수가 1분위보다 작은 영화에서 회귀모형 기반 기법은 낮은 예측 정확도를 보였으며, 분류기반 기법은 반대로 가장 우수한 결과를 얻었다. 즉, 영화 수요의 분포 특성에 따라서 차별화된 기계학습 기법을 적용하는 것이 필요하다.

The Predictive QSAR Model for hERG Inhibitors Using Bayesian and Random Forest Classification Method

  • Kim, Jun-Hyoung;Chae, Chong-Hak;Kang, Shin-Myung;Lee, Joo-Yon;Lee, Gil-Nam;Hwang, Soon-Hee;Kang, Nam-Sook
    • Bulletin of the Korean Chemical Society
    • /
    • 제32권4호
    • /
    • pp.1237-1240
    • /
    • 2011
  • In this study, we have developed a ligand-based in-silico prediction model to classify chemical structures into hERG blockers using Bayesian and random forest modeling methods. These models were built based on patch clamp experimental results. The findings presented in this work indicate that Laplacian-modified naive Bayesian classification with diverse selection is useful for predicting hERG inhibitors when a large data set is not obtained.

A Study on Diabetes Management System Based on Logistic Regression and Random Forest

  • ByungJoo Kim
    • International journal of advanced smart convergence
    • /
    • 제13권2호
    • /
    • pp.61-68
    • /
    • 2024
  • In the quest for advancing diabetes diagnosis, this study introduces a novel two-step machine learning approach that synergizes the probabilistic predictions of Logistic Regression with the classification prowess of Random Forest. Diabetes, a pervasive chronic disease impacting millions globally, necessitates precise and early detection to mitigate long-term complications. Traditional diagnostic methods, while effective, often entail invasive testing and may not fully leverage the patterns hidden in patient data. Addressing this gap, our research harnesses the predictive capability of Logistic Regression to estimate the likelihood of diabetes presence, followed by employing Random Forest to classify individuals into diabetic, pre-diabetic or nondiabetic categories based on the computed probabilities. This methodology not only capitalizes on the strengths of both algorithms-Logistic Regression's proficiency in estimating nuanced probabilities and Random Forest's robustness in classification-but also introduces a refined mechanism to enhance diagnostic accuracy. Through the application of this model to a comprehensive diabetes dataset, we demonstrate a marked improvement in diagnostic precision, as evidenced by superior performance metrics when compared to other machine learning approaches. Our findings underscore the potential of integrating diverse machine learning models to improve clinical decision-making processes, offering a promising avenue for the early and accurate diagnosis of diabetes and potentially other complex diseases.

Feature Selection Algorithm for Intrusions Detection System using Sequential Forward Search and Random Forest Classifier

  • Lee, Jinlee;Park, Dooho;Lee, Changhoon
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제11권10호
    • /
    • pp.5132-5148
    • /
    • 2017
  • Cyber attacks are evolving commensurate with recent developments in information security technology. Intrusion detection systems collect various types of data from computers and networks to detect security threats and analyze the attack information. The large amount of data examined make the large number of computations and low detection rates problematic. Feature selection is expected to improve the classification performance and provide faster and more cost-effective results. Despite the various feature selection studies conducted for intrusion detection systems, it is difficult to automate feature selection because it is based on the knowledge of security experts. This paper proposes a feature selection technique to overcome the performance problems of intrusion detection systems. Focusing on feature selection, the first phase of the proposed system aims at constructing a feature subset using a sequential forward floating search (SFFS) to downsize the dimension of the variables. The second phase constructs a classification model with the selected feature subset using a random forest classifier (RFC) and evaluates the classification accuracy. Experiments were conducted with the NSL-KDD dataset using SFFS-RF, and the results indicated that feature selection techniques are a necessary preprocessing step to improve the overall system performance in systems that handle large datasets. They also verified that SFFS-RF could be used for data classification. In conclusion, SFFS-RF could be the key to improving the classification model performance in machine learning.

A Predictive Model to identify possible affected Bipolar disorder students using Naive Baye's, Random Forest and SVM machine learning techniques of data mining and Building a Sequential Deep Learning Model using Keras

  • Peerbasha, S.;Surputheen, M. Mohamed
    • International Journal of Computer Science & Network Security
    • /
    • 제21권5호
    • /
    • pp.267-274
    • /
    • 2021
  • Medical care practices include gathering a wide range of student data that are with manic episodes and depression which would assist the specialist with diagnosing a health condition of the students correctly. In this way, the instructors of the specific students will also identify those students and take care of them well. The data which we collected from the students could be straightforward indications seen by them. The artificial intelligence has been utilized with Naive Baye's classification, Random forest classification algorithm, SVM algorithm to characterize the datasets which we gathered to check whether the student is influenced by Bipolar illness or not. Performance analysis of the disease data for the algorithms used is calculated and compared. Also, a sequential deep learning model is builded using Keras. The consequences of the simulations show the efficacy of the grouping techniques on a dataset, just as the nature and complexity of the dataset utilized.