• Title/Summary/Keyword: recursive feature elimination

Search Result 25, Processing Time 0.03 seconds

Diagnosis of Alzheimer's Disease using Combined Feature Selection Method

  • Faisal, Fazal Ur Rehman;Khatri, Uttam;Kwon, Goo-Rak
    • Journal of Korea Multimedia Society
    • /
    • v.24 no.5
    • /
    • pp.667-675
    • /
    • 2021
  • The treatments for symptoms of Alzheimer's disease are being provided and for the early diagnosis several researches are undergoing. In this regard, by using T1-weighted images several classification techniques had been proposed to distinguish among AD, MCI, and Healthy Control (HC) patients. In this paper, we also used some traditional Machine Learning (ML) approaches in order to diagnose the AD. This paper consists of an improvised feature selection method which is used to reduce the model complexity which accounted an issue while utilizing the ML approaches. In our presented work, combination of subcortical and cortical features of 308 subjects of ADNI dataset has been used to diagnose AD using structural magnetic resonance (sMRI) images. Three classification experiments were performed: binary classification. i.e., AD vs eMCI, AD vs lMCI, and AD vs HC. Proposed Feature Selection method consist of a combination of Principal Component Analysis and Recursive Feature Elimination method that has been used to reduce the dimension size and selection of best features simultaneously. Experiment on the dataset demonstrated that SVM is best suited for the AD vs lMCI, AD vs HC, and AD vs eMCI classification with the accuracy of 95.83%, 97.83%, and 97.87% respectively.

Landslide susceptibility assessment using feature selection-based machine learning models

  • Liu, Lei-Lei;Yang, Can;Wang, Xiao-Mi
    • Geomechanics and Engineering
    • /
    • v.25 no.1
    • /
    • pp.1-16
    • /
    • 2021
  • Machine learning models have been widely used for landslide susceptibility assessment (LSA) in recent years. The large number of inputs or conditioning factors for these models, however, can reduce the computation efficiency and increase the difficulty in collecting data. Feature selection is a good tool to address this problem by selecting the most important features among all factors to reduce the size of the input variables. However, two important questions need to be solved: (1) how do feature selection methods affect the performance of machine learning models? and (2) which feature selection method is the most suitable for a given machine learning model? This paper aims to address these two questions by comparing the predictive performance of 13 feature selection-based machine learning (FS-ML) models and 5 ordinary machine learning models on LSA. First, five commonly used machine learning models (i.e., logistic regression, support vector machine, artificial neural network, Gaussian process and random forest) and six typical feature selection methods in the literature are adopted to constitute the proposed models. Then, fifteen conditioning factors are chosen as input variables and 1,017 landslides are used as recorded data. Next, feature selection methods are used to obtain the importance of the conditioning factors to create feature subsets, based on which 13 FS-ML models are constructed. For each of the machine learning models, a best optimized FS-ML model is selected according to the area under curve value. Finally, five optimal FS-ML models are obtained and applied to the LSA of the studied area. The predictive abilities of the FS-ML models on LSA are verified and compared through the receive operating characteristic curve and statistical indicators such as sensitivity, specificity and accuracy. The results showed that different feature selection methods have different effects on the performance of LSA machine learning models. FS-ML models generally outperform the ordinary machine learning models. The best FS-ML model is the recursive feature elimination (RFE) optimized RF, and RFE is an optimal method for feature selection.

Intelligent System for the Prediction of Heart Diseases Using Machine Learning Algorithms with Anew Mixed Feature Creation (MFC) technique

  • Rawia Elarabi;Abdelrahman Elsharif Karrar;Murtada El-mukashfi El-taher
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.5
    • /
    • pp.148-162
    • /
    • 2023
  • Classification systems can significantly assist the medical sector by allowing for the precise and quick diagnosis of diseases. As a result, both doctors and patients will save time. A possible way for identifying risk variables is to use machine learning algorithms. Non-surgical technologies, such as machine learning, are trustworthy and effective in categorizing healthy and heart-disease patients, and they save time and effort. The goal of this study is to create a medical intelligent decision support system based on machine learning for the diagnosis of heart disease. We have used a mixed feature creation (MFC) technique to generate new features from the UCI Cleveland Cardiology dataset. We select the most suitable features by using Least Absolute Shrinkage and Selection Operator (LASSO), Recursive Feature Elimination with Random Forest feature selection (RFE-RF) and the best features of both LASSO RFE-RF (BLR) techniques. Cross-validated and grid-search methods are used to optimize the parameters of the estimator used in applying these algorithms. and classifier performance assessment metrics including classification accuracy, specificity, sensitivity, precision, and F1-Score, of each classification model, along with execution time and RMSE the results are presented independently for comparison. Our proposed work finds the best potential outcome across all available prediction models and improves the system's performance, allowing physicians to diagnose heart patients more accurately.

Spatial Prediction of Soil Carbon Using Terrain Analysis in a Steep Mountainous Area and the Associated Uncertainties (지형분석을 이용한 산지토양 탄소의 분포 예측과 불확실성)

  • Jeong, Gwanyong
    • Journal of The Geomorphological Association of Korea
    • /
    • v.23 no.3
    • /
    • pp.67-78
    • /
    • 2016
  • Soil carbon(C) is an essential property for characterizing soil quality. Understanding spatial patterns of soil C is particularly limited for mountain areas. This study aims to predict the spatial pattern of soil C using terrain analysis in a steep mountainous area. Specifically, model performances and prediction uncertainties were investigated based on the number of resampling repetitions. Further, important predictors for soil C were also identified. Finally, the spatial distribution of uncertainty was analyzed. A total of 91 soil samples were collected via conditioned latin hypercube sampling and a digital soil C map was developed using support vector regression which is one of the powerful machine learning methods. Results showed that there were no distinct differences of model performances depending on the number of repetitions except for 10-fold cross validation. For soil C, elevation and surface curvature were selected as important predictors by recursive feature elimination. Soil C showed higher values in higher elevation and concave slopes. The spatial pattern of soil C might possibly reflect lateral movement of water and materials along the surface configuration of the study area. The higher values of uncertainty in higher elevation and concave slopes might be related to geomorphological characteristics of the research area and the sampling design. This study is believed to provide a better understanding of the relationship between geomorphology and soil C in the mountainous ecosystem.

Use of a Machine Learning Algorithm to Predict Individuals with Suicide Ideation in the General Population

  • Ryu, Seunghyong;Lee, Hyeongrae;Lee, Dong-Kyun;Park, Kyeongwoo
    • Psychiatry investigation
    • /
    • v.15 no.11
    • /
    • pp.1030-1036
    • /
    • 2018
  • Objective In this study, we aimed to develop a model predicting individuals with suicide ideation within a general population using a machine learning algorithm. Methods Among 35,116 individuals aged over 19 years from the Korea National Health & Nutrition Examination Survey, we selected 11,628 individuals via random down-sampling. This included 5,814 suicide ideators and the same number of non-suicide ideators. We randomly assigned the subjects to a training set (n=10,466) and a test set (n=1,162). In the training set, a random forest model was trained with 15 features selected with recursive feature elimination via 10-fold cross validation. Subsequently, the fitted model was used to predict suicide ideators in the test set and among the total of 35,116 subjects. All analyses were conducted in R. Results The prediction model achieved a good performance [area under receiver operating characteristic curve (AUC)=0.85] in the test set and predicted suicide ideators among the total samples with an accuracy of 0.821, sensitivity of 0.836, and specificity of 0.807. Conclusion This study shows the possibility that a machine learning approach can enable screening for suicide risk in the general population. Further work is warranted to increase the accuracy of prediction.

Classification method for failure modes of RC columns based on key characteristic parameters

  • Yu, Bo;Yu, Zecheng;Li, Qiming;Li, Bing
    • Structural Engineering and Mechanics
    • /
    • v.84 no.1
    • /
    • pp.1-16
    • /
    • 2022
  • An efficient and accurate classification method for failure modes of reinforced concrete (RC) columns was proposed based on key characteristic parameters. The weight coefficients of seven characteristic parameters for failure modes of RC columns were determined first based on the support vector machine-recursive feature elimination. Then key characteristic parameters for classifying flexure, flexure-shear and shear failure modes of RC columns were selected respectively. Subsequently, a support vector machine with key characteristic parameters (SVM-K) was proposed to classify three types of failure modes of RC columns. The optimal parameters of SVM-K were determined by using the ten-fold cross-validation and the grid-search algorithm based on 270 sets of available experimental data. Results indicate that the proposed SVM-K has high overall accuracy, recall and precision (e.g., accuracy>95%, recall>90%, precision>90%), which means that the proposed SVM-K has superior performance for classification of failure modes of RC columns. Based on the selected key characteristic parameters for different types of failure modes of RC columns, the accuracy of SVM-K is improved and the decision function of SVM-K is simplified by reducing the dimensions and number of support vectors.

Runoff Prediction from Machine Learning Models Coupled with Empirical Mode Decomposition: A case Study of the Grand River Basin in Canada

  • Parisouj, Peiman;Jun, Changhyun;Nezhad, Somayeh Moghimi;Narimani, Roya
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2022.05a
    • /
    • pp.136-136
    • /
    • 2022
  • This study investigates the possibility of coupling empirical mode decomposition (EMD) for runoff prediction from machine learning (ML) models. Here, support vector regression (SVR) and convolutional neural network (CNN) were considered for ML algorithms. Precipitation (P), minimum temperature (Tmin), maximum temperature (Tmax) and their intrinsic mode functions (IMF) values were used for input variables at a monthly scale from Jan. 1973 to Dec. 2020 in the Grand river basin, Canada. The support vector machine-recursive feature elimination (SVM-RFE) technique was applied for finding the best combination of predictors among input variables. The results show that the proposed method outperformed the individual performance of SVR and CNN during the training and testing periods in the study area. According to the correlation coefficient (R), the EMD-SVR model outperformed the EMD-CNN model in both training and testing even though the CNN indicated a better performance than the SVR before using IMF values. The EMD-SVR model showed higher improvement in R value (38.7%) than that from the EMD-CNN model (7.1%). It should be noted that the coupled models of EMD-SVR and EMD-CNN represented much higher accuracy in runoff prediction with respect to the considered evaluation indicators, including root mean square error (RMSE) and R values.

  • PDF

Development of machine learning framework to inverse-track a contaminant source of hazardous chemicals in rivers (하천에 유입된 유해화학물질의 역추적을 위한 기계학습 프레임워크 개발)

  • Kwon, Siyoon;Seo, Il Won
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2020.06a
    • /
    • pp.112-112
    • /
    • 2020
  • 하천에서 유해화학물질 유입 사고 발생 시 수환경 피해를 최소화하기 위해 신속한 초기 대응이 필요하다. 따라서, 본 연구에서는 수환경 화학사고 대응 시스템 구축을 위해 하천 실시간 모니터링 지점에서 관측된 유해화학물질의 농도 자료를 이용하여 발생원의 유입 지점과 유입량을 역추적하는 프레임워크를 개발하였다. 본 연구에서 제시하는 프레임워크는 첫 번째로 하천 저장대 모형(Transient Storage Zone Model; TSM)과 HEC-RAS 모형을 이용하여 다양한 유량의 수리 조건에서 화학사고 시나리오를 생성하는 단계, 두번째로 생성된 시나리오의 유입 지점과 유입량에 대한 시간-농도 곡선 (BreakThrough Curve; BTC)을 21개의 곡선특징 (BTC feature)으로 추출하는 단계, 최종적으로 재귀적 특징 선택법(Recursive Feature Elimination; RFE)을 이용하여 의사결정나무 모형, 랜덤포레스트 모형, Xgboost 모형, 선형 서포트 벡터 머신, 커널 서포트 벡터 머신 그리고 Ridge 모형에 대한 모형별 주요 특징을 학습하고 성능을 비교하여 각각 유입 위치와 유입 질량 예측에 대한 최적 모형 및 특징 조합을 제시하는 단계로 구축하였다. 또한, 현장 적용성 제고를 위해 시간-농도 곡선을 2가지 경우 (Whole BTC와 Fractured BTC)로 가정하여 기계학습 모형을 학습시켜 모의결과를 비교하였다. 제시된 프레임워크의 검증을 위해서 낙동강 지류인 감천에 적용하여 모형을 구축하고 시나리오 자료 기반 검증과 Rhodamine WT를 이용한 추적자 실험자료를 이용한 검증을 수행하였다. 기계학습 모형들의 비교 검증 결과, 각 모형은 가중항 기반과 불순도 감소량 기반 특징 중요도 산출 방식에 따라 주요 특징이 상이하게 산출되었으며, 전체 시간-농도 곡선 (WBTC)과 부분 시간-농도 곡선 (FBTC)별 최적 모형도 다르게 산출되었다. 유입 위치 정확도 및 유입 질량 예측에 대한 R2는 대부분의 모형이 90% 이상의 우수한 결과를 나타냈다.

  • PDF

Self-optimizing feature selection algorithm for enhancing campaign effectiveness (캠페인 효과 제고를 위한 자기 최적화 변수 선택 알고리즘)

  • Seo, Jeoung-soo;Ahn, Hyunchul
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.4
    • /
    • pp.173-198
    • /
    • 2020
  • For a long time, many studies have been conducted on predicting the success of campaigns for customers in academia, and prediction models applying various techniques are still being studied. Recently, as campaign channels have been expanded in various ways due to the rapid revitalization of online, various types of campaigns are being carried out by companies at a level that cannot be compared to the past. However, customers tend to perceive it as spam as the fatigue of campaigns due to duplicate exposure increases. Also, from a corporate standpoint, there is a problem that the effectiveness of the campaign itself is decreasing, such as increasing the cost of investing in the campaign, which leads to the low actual campaign success rate. Accordingly, various studies are ongoing to improve the effectiveness of the campaign in practice. This campaign system has the ultimate purpose to increase the success rate of various campaigns by collecting and analyzing various data related to customers and using them for campaigns. In particular, recent attempts to make various predictions related to the response of campaigns using machine learning have been made. It is very important to select appropriate features due to the various features of campaign data. If all of the input data are used in the process of classifying a large amount of data, it takes a lot of learning time as the classification class expands, so the minimum input data set must be extracted and used from the entire data. In addition, when a trained model is generated by using too many features, prediction accuracy may be degraded due to overfitting or correlation between features. Therefore, in order to improve accuracy, a feature selection technique that removes features close to noise should be applied, and feature selection is a necessary process in order to analyze a high-dimensional data set. Among the greedy algorithms, SFS (Sequential Forward Selection), SBS (Sequential Backward Selection), SFFS (Sequential Floating Forward Selection), etc. are widely used as traditional feature selection techniques. It is also true that if there are many risks and many features, there is a limitation in that the performance for classification prediction is poor and it takes a lot of learning time. Therefore, in this study, we propose an improved feature selection algorithm to enhance the effectiveness of the existing campaign. The purpose of this study is to improve the existing SFFS sequential method in the process of searching for feature subsets that are the basis for improving machine learning model performance using statistical characteristics of the data to be processed in the campaign system. Through this, features that have a lot of influence on performance are first derived, features that have a negative effect are removed, and then the sequential method is applied to increase the efficiency for search performance and to apply an improved algorithm to enable generalized prediction. Through this, it was confirmed that the proposed model showed better search and prediction performance than the traditional greed algorithm. Compared with the original data set, greed algorithm, genetic algorithm (GA), and recursive feature elimination (RFE), the campaign success prediction was higher. In addition, when performing campaign success prediction, the improved feature selection algorithm was found to be helpful in analyzing and interpreting the prediction results by providing the importance of the derived features. This is important features such as age, customer rating, and sales, which were previously known statistically. Unlike the previous campaign planners, features such as the combined product name, average 3-month data consumption rate, and the last 3-month wireless data usage were unexpectedly selected as important features for the campaign response, which they rarely used to select campaign targets. It was confirmed that base attributes can also be very important features depending on the type of campaign. Through this, it is possible to analyze and understand the important characteristics of each campaign type.

Prediction of Customer Satisfaction Using RFE-SHAP Feature Selection Method (RFE-SHAP을 활용한 온라인 리뷰를 통한 고객 만족도 예측)

  • Olga Chernyaeva;Taeho Hong
    • Journal of Intelligence and Information Systems
    • /
    • v.29 no.4
    • /
    • pp.325-345
    • /
    • 2023
  • In the rapidly evolving domain of e-commerce, our study presents a cohesive approach to enhance customer satisfaction prediction from online reviews, aligning methodological innovation with practical insights. We integrate the RFE-SHAP feature selection with LDA topic modeling to streamline predictive analytics in e-commerce. This integration facilitates the identification of key features-specifically, narrowing down from an initial set of 28 to an optimal subset of 14 features for the Random Forest algorithm. Our approach strategically mitigates the common issue of overfitting in models with an excess of features, leading to an improved accuracy rate of 84% in our Random Forest model. Central to our analysis is the understanding that certain aspects in review content, such as quality, fit, and durability, play a pivotal role in influencing customer satisfaction, especially in the clothing sector. We delve into explaining how each of these selected features impacts customer satisfaction, providing a comprehensive view of the elements most appreciated by customers. Our research makes significant contributions in two key areas. First, it enhances predictive modeling within the realm of e-commerce analytics by introducing a streamlined, feature-centric approach. This refinement in methodology not only bolsters the accuracy of customer satisfaction predictions but also sets a new standard for handling feature selection in predictive models. Second, the study provides actionable insights for e-commerce platforms, especially those in the clothing sector. By highlighting which aspects of customer reviews-like quality, fit, and durability-most influence satisfaction, we offer a strategic direction for businesses to tailor their products and services.