• Title/Summary/Keyword: Feature Importance Analysis

Search Result 135, Processing Time 0.026 seconds

Comparison of Chlorophyll-a Prediction and Analysis of Influential Factors in Yeongsan River Using Machine Learning and Deep Learning (머신러닝과 딥러닝을 이용한 영산강의 Chlorophyll-a 예측 성능 비교 및 변화 요인 분석)

  • Sun-Hee, Shim;Yu-Heun, Kim;Hye Won, Lee;Min, Kim;Jung Hyun, Choi
    • Journal of Korean Society on Water Environment
    • /
    • v.38 no.6
    • /
    • pp.292-305
    • /
    • 2022
  • The Yeongsan River, one of the four largest rivers in South Korea, has been facing difficulties with water quality management with respect to algal bloom. The algal bloom menace has become bigger, especially after the construction of two weirs in the mainstream of the Yeongsan River. Therefore, the prediction and factor analysis of Chlorophyll-a (Chl-a) concentration is needed for effective water quality management. In this study, Chl-a prediction model was developed, and the performance evaluated using machine and deep learning methods, such as Deep Neural Network (DNN), Random Forest (RF), and eXtreme Gradient Boosting (XGBoost). Moreover, the correlation analysis and the feature importance results were compared to identify the major factors affecting the concentration of Chl-a. All models showed high prediction performance with an R2 value of 0.9 or higher. In particular, XGBoost showed the highest prediction accuracy of 0.95 in the test data.The results of feature importance suggested that Ammonia (NH3-N) and Phosphate (PO4-P) were common major factors for the three models to manage Chl-a concentration. From the results, it was confirmed that three machine learning methods, DNN, RF, and XGBoost are powerful methods for predicting water quality parameters. Also, the comparison between feature importance and correlation analysis would present a more accurate assessment of the important major factors.

Stacked Autoencoder Based Malware Feature Refinement Technology Research (Stacked Autoencoder 기반 악성코드 Feature 정제 기술 연구)

  • Kim, Hong-bi;Lee, Tae-jin
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.30 no.4
    • /
    • pp.593-603
    • /
    • 2020
  • The advent of malicious code has increased exponentially due to the spread of malicious code generation tools in accordance with the development of the network, but there is a limit to the response through existing malicious code detection methods. According to this situation, a machine learning-based malicious code detection method is evolving, and in this paper, the feature of data is extracted from the PE header for machine-learning-based malicious code detection, and then it is used to automate the malware through autoencoder. Research on how to extract the indicated features and feature importance. In this paper, 549 features composed of information such as DLL/API that can be identified from PE files that are commonly used in malware analysis are extracted, and autoencoder is used through the extracted features to improve the performance of malware detection in machine learning. It was proved to be successful in providing excellent accuracy and reducing the processing time by 2 times by effectively extracting the features of the data by compressively storing the data. The test results have been shown to be useful for classifying malware groups, and in the future, a classifier such as SVM will be introduced to continue research for more accurate malware detection.

Real-Time Locomotion Mode Recognition Employing Correlation Feature Analysis Using EMG Pattern

  • Kim, Deok-Hwan;Cho, Chi-Young;Ryu, Jaehwan
    • ETRI Journal
    • /
    • v.36 no.1
    • /
    • pp.99-105
    • /
    • 2014
  • This paper presents a new locomotion mode recognition method based on a transformed correlation feature analysis using an electromyography (EMG) pattern. Each movement is recognized using six weighted subcorrelation filters, which are applied to the correlation feature analysis through the use of six time-domain features. The proposed method has a high recognition rate because it reflects the importance of the different features according to the movements and thereby enables one to recognize real-time EMG patterns, owing to the rapid execution of the correlation feature analysis. The experiment results show that the discriminating power of the proposed method is 85.89% (${\pm}2.5$) when walking on a level surface, 96.47% (${\pm}0.9$) when going up stairs, and 96.37% (${\pm}1.3$) when going down stairs for given normal movement data. This makes its accuracy and stability better than that found for the principal component analysis and linear discriminant analysis methods.

Arrow Diagrams for Kernel Principal Component Analysis

  • Huh, Myung-Hoe
    • Communications for Statistical Applications and Methods
    • /
    • v.20 no.3
    • /
    • pp.175-184
    • /
    • 2013
  • Kernel principal component analysis(PCA) maps observations in nonlinear feature space to a reduced dimensional plane of principal components. We do not need to specify the feature space explicitly because the procedure uses the kernel trick. In this paper, we propose a graphical scheme to represent variables in the kernel principal component analysis. In addition, we propose an index for individual variables to measure the importance in the principal component plane.

Feature selection-based Risk Prediction for Hypertension in Korean men (한국 남성의 고혈압에 대한 특징 선택 기반 위험 예측)

  • Dashdondov, Khongorzul;Kim, Mi-Hye
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2021.05a
    • /
    • pp.323-325
    • /
    • 2021
  • In this article, we have improved the prediction of hypertension detection using the feature selection method for the Korean national health data named by the KNHANES database. The study identified a variety of risk factors associated with chronic hypertension. The paper is divided into two modules. The first of these is a data pre-processing step that uses a factor analysis (FA) based feature selection method from the dataset. The next module applies a predictive analysis step to detect and predict hypertension risk prediction. In this study, we compare the mean standard error (MSE), F1-score, and area under the ROC curve (AUC) for each classification model. The test results show that the proposed FIFA-OE-NB algorithm has an MSE, F1-score, and AUC outcomes 0.259, 0.460, and 64.70%, respectively. These results demonstrate that the proposed FIFA-OE method outperforms other models for hypertension risk predictions.

Application of Random Forest Algorithm for the Decision Support System of Medical Diagnosis with the Selection of Significant Clinical Test (의료진단 및 중요 검사 항목 결정 지원 시스템을 위한 랜덤 포레스트 알고리즘 적용)

  • Yun, Tae-Gyun;Yi, Gwan-Su
    • The Transactions of The Korean Institute of Electrical Engineers
    • /
    • v.57 no.6
    • /
    • pp.1058-1062
    • /
    • 2008
  • In clinical decision support system(CDSS), unlike rule-based expert method, appropriate data-driven machine learning method can easily provide the information of individual feature(clinical test) for disease classification. However, currently developed methods focus on the improvement of the classification accuracy for diagnosis. With the analysis of feature importance in classification, one may infer the novel clinical test sets which highly differentiate the specific diseases or disease states. In this background, we introduce a novel CDSS that integrate a classifier and feature selection module together. Random forest algorithm is applied for the classifier and the feature importance measure. The system selects the significant clinical tests discriminating the diseases by examining the classification error during backward elimination of the features. The superior performance of random forest algorithm in clinical classification was assessed against artificial neural network and decision tree algorithm by using breast cancer, diabetes and heart disease data in UCI Machine Learning Repository. The test with the same data sets shows that the proposed system can successfully select the significant clinical test set for each disease.

A gradient boosting regression based approach for energy consumption prediction in buildings

  • Bataineh, Ali S. Al
    • Advances in Energy Research
    • /
    • v.6 no.2
    • /
    • pp.91-101
    • /
    • 2019
  • This paper proposes an efficient data-driven approach to build models for predicting energy consumption in buildings. Data used in this research is collected by installing humidity and temperature sensors at different locations in a building. In addition to this, weather data from nearby weather station is also included in the dataset to study the impact of weather conditions on energy consumption. One of the main emphasize of this research is to make feature selection independent of domain knowledge. Therefore, to extract useful features from data, two different approaches are tested: one is feature selection through principal component analysis and second is relative importance-based feature selection in original domain. The regression model used in this research is gradient boosting regression and its optimal parameters are chosen through a two staged coarse-fine search approach. In order to evaluate the performance of model, different performance evaluation metrics like r2-score and root mean squared error are used. Results have shown that best performance is achieved, when relative importance-based feature selection is used with gradient boosting regressor. Results of proposed technique has also outperformed the results of support vector machines and neural network-based approaches tested on the same dataset.

An Improved method of Two Stage Linear Discriminant Analysis

  • Chen, Yarui;Tao, Xin;Xiong, Congcong;Yang, Jucheng
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.12 no.3
    • /
    • pp.1243-1263
    • /
    • 2018
  • The two-stage linear discrimination analysis (TSLDA) is a feature extraction technique to solve the small size sample problem in the field of image recognition. The TSLDA has retained all subspace information of the between-class scatter and within-class scatter. However, the feature information in the four subspaces may not be entirely beneficial for classification, and the regularization procedure for eliminating singular metrics in TSLDA has higher time complexity. In order to address these drawbacks, this paper proposes an improved two-stage linear discriminant analysis (Improved TSLDA). The Improved TSLDA proposes a selection and compression method to extract superior feature information from the four subspaces to constitute optimal projection space, where it defines a single Fisher criterion to measure the importance of single feature vector. Meanwhile, Improved TSLDA also applies an approximation matrix method to eliminate the singular matrices and reduce its time complexity. This paper presents comparative experiments on five face databases and one handwritten digit database to validate the effectiveness of the Improved TSLDA.

Analysis of the Feature Importance of Occupational Accidents Occurring at Construction Sites on the Severity of Lost Workdays (건설 현장에서 발생한 업무상 재해가 근로손실일수 심각도에 미치는 특징 중요도 분석)

  • Kang, Kyung-Su;Choi, Jae-Hyun;Ryu, Han-Guk
    • Journal of the Korea Institute of Building Construction
    • /
    • v.21 no.2
    • /
    • pp.165-174
    • /
    • 2021
  • The construction industry causes the most accidents and fatalities among all industries. Although many efforts have been made to reduce safety accidents in construction, the study on the lost workdays that return to work place is insufficient. Therefore, this study proposes a model that classifies the lost workdays lost into moderate and severity, and derives the importance of variable and analyzes important factors through the trained random forest model. We analyze the learning process of the random forest which is a black box model, and extracted important variables that impact on the severity of the lost workdays through the extracted feature importance. The factors existing inside were analyzed through the extracted variables. The purpose of this study is to analyze the accident case data at the construction site through a random forest model and to review variables that have a high impact on the lost workdays. In the future, this sutdy can apply to improve construction safety management and reduce the accident of industrial accidents.

Phonological Error Patterns: Clinical Aspects on Coronal Feature (음운 오류 패턴: 설정성 자질의 임상적 고찰)

  • Kim, Min-Jung;Lee, Sung-Eun
    • Phonetics and Speech Sciences
    • /
    • v.2 no.4
    • /
    • pp.239-244
    • /
    • 2010
  • The purpose of this study is to investigate two phonological error patterns on coronal feature of children with functional articulation disorders and to compare them with those of general children. We tested 120 children with functional articulation disorders and 100 general children from 2~4 years of age with 'Assessment of Phonology & Articulation for Chidren(APAC)'. The results were as follows: (1) 37 disordered children substituted [+coronal] consonants for [-coronal] consonants (fronting of velars) and 9 disordered children substituted [-coronal] consonants for [+coronal] consonants (backing to velars). (2) Theses two phonological patterns were affected by the articulatory place of following phoneme. (3) The fronting pattern of children with articulation disorders was similar with that of general children, but their backing pattern was different with that of general children. These results show the clinical usefulness of coronal feature in phonological pattern analysis, the need of articulatory assessment with various phonetic context, and the importance of error contexts in clinical judgment.

  • PDF