• 제목/요약/키워드: Pearson Feature Selection

검색결과 11건 처리시간 0.023초

실시간 공격 탐지를 위한 Pearson 상관계수 기반 특징 집합 선택 방법 (A Feature Set Selection Approach Based on Pearson Correlation Coefficient for Real Time Attack Detection)

  • 강승호;정인선;임형석
    • 융합보안논문지
    • /
    • 제18권5_1호
    • /
    • pp.59-66
    • /
    • 2018
  • 기계학습을 이용하는 침입 탐지 시스템의 성능은 특징 집합의 구성과 크기에 크게 좌우된다. 탐지율과 같은 시스템의 탐지 정확도는 특징 집합의 구성에, 학습 및 탐지 시간은 특징 집합의 크기에 의존한다. 따라서 즉각적인 대응이 필수인 침입 탐지 시스템의 실시간 탐지가 가능하도록 하려면, 특징 집합은 크기가 작으면서도 적절한 특징들로 구성하여야 한다. 본 논문은 실시간 탐지를 위한 특징 집합 선택 문제를 해결하기 위해 사용했던 기존의 다목적 유전자 알고리즘에 특징 간의 Pearson 상관계수를 함께 사용하면 탐지율을 거의 낮추지 않으면서도 특징 집합의 크기를 줄일 수 있음을 보인다. 제안한 방법의 성능평가를 위해 NSL_KDD 데이터를 사용하여 10가지 공격 유형과 정상적인 트래픽을 구별하도록 인공신경망을 설계, 구현하여 실험한다.

  • PDF

Performance Improvement of Freight Logistics Hub Selection in Thailand by Coordinated Simulation and AHP

  • Wanitwattanakosol, Jirapat;Holimchayachotikul, Pongsak;Nimsrikul, Phatchari;Sopadang, Apichat
    • Industrial Engineering and Management Systems
    • /
    • 제9권2호
    • /
    • pp.88-96
    • /
    • 2010
  • This paper presents a two-phase quantitative framework to aid the decision making process for effective selection of an efficient freight logistics hub from 8 alternatives in Thailand on the North-South economic corridor. Phase 1 employs both multiple regression and Pearson Feature selection to find the important criteria, as defined by logistics hub score, and to reduce number of criteria by eliminating the less important criteria. The result of Pearson Feature selection indicated that only 5 of 15 criteria affected the logistics hub score. Moreover, Genetic Algorithm (GA) was constructed from original 15 criteria data set to find the relationship between logistics criteria and freight logistics hub score. As a result, the statistical tools are provided the same 5 important criteria, affecting logistics hub score from GA, and data mining tool. Phase 2 performs the fuzzy stochastic AHP analysis with the five important criteria. This approach could help to gain insight into how the imprecision in judgment ratios may affect their alternatives toward the best solution and how the best alternative may be identified with certain confidence. The main objective of the paper is to find the best alternative for selecting freight logistics hub under proper criteria. The experimental results show that by using this approach, Chiang Mai province is the best place with the confidence interval 95%.

단백체 스펙트럼 데이터의 분류를 위한 랜덤 포리스트 기반 특성 선택 알고리즘 (Feature Selection for Classification of Mass Spectrometric Proteomic Data Using Random Forest)

  • 온승엽;지승도;한미영
    • 한국시뮬레이션학회논문지
    • /
    • 제22권4호
    • /
    • pp.139-147
    • /
    • 2013
  • 본 논문에서는 질량 분석 방법에 의하여 산출된 단백체 데이터(mass spectrometric proteomic data)의 분류 분석(classification analysis)을 위한 새로운 특성 선택(feature selection) 방법을 제안한다. 이 방법은 i)높은 상관관계를 가지는 중복된 특성을 효과적으로 제거하는 전처리 단계와 ii)토너먼트(tournament) 전략을 사용하여 최적 특성 부분집합(optimal feature subset)을 탐색해 내는 단계로 구성되어 있다. 제안되는 방법을 실제 암진단에 사용되는 공개된 혈액 단백체 데이터에 적용하였으며 널리 사용되는 타 방법과 비교할 때 우수한 성능과 균형된 특이도와 민감도를 달성함을 실증하였다.

기계학습을 이용한 밴드갭 예측과 소재의 조성기반 특성인자의 효과 (Compositional Feature Selection and Its Effects on Bandgap Prediction by Machine Learning)

  • 남충희
    • 한국재료학회지
    • /
    • 제33권4호
    • /
    • pp.164-174
    • /
    • 2023
  • The bandgap characteristics of semiconductor materials are an important factor when utilizing semiconductor materials for various applications. In this study, based on data provided by AFLOW (Automatic-FLOW for Materials Discovery), the bandgap of a semiconductor material was predicted using only the material's compositional features. The compositional features were generated using the python module of 'Pymatgen' and 'Matminer'. Pearson's correlation coefficients (PCC) between the compositional features were calculated and those with a correlation coefficient value larger than 0.95 were removed in order to avoid overfitting. The bandgap prediction performance was compared using the metrics of R2 score and root-mean-squared error. By predicting the bandgap with randomforest and xgboost as representatives of the ensemble algorithm, it was found that xgboost gave better results after cross-validation and hyper-parameter tuning. To investigate the effect of compositional feature selection on the bandgap prediction of the machine learning model, the prediction performance was studied according to the number of features based on feature importance methods. It was found that there were no significant changes in prediction performance beyond the appropriate feature. Furthermore, artificial neural networks were employed to compare the prediction performance by adjusting the number of features guided by the PCC values, resulting in the best R2 score of 0.811. By comparing and analyzing the bandgap distribution and prediction performance according to the material group containing specific elements (F, N, Yb, Eu, Zn, B, Si, Ge, Fe Al), various information for material design was obtained.

머신러닝 기반 CFS(Correlation-based Feature Selection)기법과 Random Forest모델을 활용한 BMI(Benthic Macroinvertebrate Index) 예측에 관한 연구 (A Study on the prediction of BMI(Benthic Macroinvertebrate Index) using Machine Learning Based CFS(Correlation-based Feature Selection) and Random Forest Model)

  • 고우석;윤춘경;이한필;황순진;이상우
    • 한국물환경학회지
    • /
    • 제35권5호
    • /
    • pp.425-431
    • /
    • 2019
  • Recently, people have been attracting attention to the good quality of water resources as well as water welfare. to improve the quality of life. This study is a papers on the prediction of benthic macroinvertebrate index (BMI), which is a aquatic ecological health, using the machine learning based CFS (Correlation-based Feature Selection) method and the random forest model to compare the measured and predicted values of the BMI. The data collected from the Han River's branch for 10 years are extracted and utilized in 1312 data. Through the utilized data, Pearson correlation analysis showed a lack of correlation between single factor and BMI. The CFS method for multiple regression analysis was introduced. This study calculated 10 factors(water temperature, DO, electrical conductivity, turbidity, BOD, $NH_3-N$, T-N, $PO_4-P$, T-P, Average flow rate) that are considered to be related to the BMI. The random forest model was used based on the ten factors. In order to prove the validity of the model, $R^2$, %Difference, NSE (Nash-Sutcliffe Efficiency) and RMSE (Root Mean Square Error) were used. Each factor was 0.9438, -0.997, and 0,992, and accuracy rate was 71.6% level. As a result, These results can suggest the future direction of water resource management and Pre-review function for water ecological prediction.

The ensemble approach in comparison with the diverse feature selection techniques for estimating NPPs parameters using the different learning algorithms of the feed-forward neural network

  • Moshkbar-Bakhshayesh, Khalil
    • Nuclear Engineering and Technology
    • /
    • 제53권12호
    • /
    • pp.3944-3951
    • /
    • 2021
  • Several reasons such as no free lunch theorem indicate that there is not a universal Feature selection (FS) technique that outperforms other ones. Moreover, some approaches such as using synthetic dataset, in presence of large number of FS techniques, are very tedious and time consuming task. In this study to tackle the issue of dependency of estimation accuracy on the selected FS technique, a methodology based on the heterogeneous ensemble is proposed. The performance of the major learning algorithms of neural network (i.e. the FFNN-BR, the FFNN-LM) in combination with the diverse FS techniques (i.e. the NCA, the F-test, the Kendall's tau, the Pearson, the Spearman, and the Relief) and different combination techniques of the heterogeneous ensemble (i.e. the Min, the Median, the Arithmetic mean, and the Geometric mean) are considered. The target parameters/transients of Bushehr nuclear power plant (BNPP) are examined as the case study. The results show that the Min combination technique gives the more accurate estimation. Therefore, if the number of FS techniques is m and the number of learning algorithms is n, by the heterogeneous ensemble, the search space for acceptable estimation of the target parameters may be reduced from n × m to n × 1. The proposed methodology gives a simple and practical approach for more reliable and more accurate estimation of the target parameters compared to the methods such as the use of synthetic dataset or trial and error methods.

전진선택법에 의해 선택된 부분 상관관계의 유전자들을 이용한 암 분류 (Classifying Cancer Using Partially Correlated Genes Selected by Forward Selection Method)

  • 유시호;조성배
    • 대한전자공학회논문지SP
    • /
    • 제41권3호
    • /
    • pp.83-92
    • /
    • 2004
  • 유전 발현 데이터는 생명체의 특정 조직에서 채취한 샘플을 마이크로어레이상에서 측정한 것으로, 유전자들의 발현 정도가 수치로 나타난 데이터이다. 일반적으로 정상조직과 이상조직에서 관련 유전자들의 발현 정도는 차이를 보이기 때문에 유전 발현 데이터를 통하여 암을 분류할 수 있다. 그러나 분류에 모든 유전자가 관여하지는 않으므로 효율적인 암의 분류를 위해서는 관련성 있는 소수의 유전자만을 선별해내는 작업인 특징선택 방법이 필요하다. 본 논문에서는 회귀분석의 변수선택방법중 하나인 전진 선택법(forward selection method)을 사용하여 유전자들을 선하고 분류하는 방법을 제안한다. 이 방법은 선택되는 유전자들의 중복된 정보를 최소화시켜 암의 분류에 있어 보다 효과적인 유전자 선택을 한다. 실험데이터는 대장암 데이터(Colon cancer dataset)를 사용하였고, 분류기는 k-최근접 이웃(KNN)을 사용하였다. 이 방법과 상관계수를 이용한 특징 선택방법인 피어슨 상관계수와 스피어맨 상관계수방법과 비교해본 결과 전진 선택법에 의한 특징선택 방법이 암의 분류에 있어서 더 효과적인 유전자 선택을 한다는 사실을 확인하였다. 실험결과 90.3%의 높은 인식률을 보였다. 추가적으로 림프종 데이터에 대한 실험을 하였고, 그 결과 전진 선택법의 유용성을 확인할 수 있었다.

Analyzing Factors Contributing to Research Performance using Backpropagation Neural Network and Support Vector Machine

  • Ermatita, Ermatita;Sanmorino, Ahmad;Samsuryadi, Samsuryadi;Rini, Dian Palupi
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제16권1호
    • /
    • pp.153-172
    • /
    • 2022
  • In this study, the authors intend to analyze factors contributing to research performance using Backpropagation Neural Network and Support Vector Machine. The analyzing factors contributing to lecturer research performance start from defining the features. The next stage is to collect datasets based on defining features. Then transform the raw dataset into data ready to be processed. After the data is transformed, the next stage is the selection of features. Before the selection of features, the target feature is determined, namely research performance. The selection of features consists of Chi-Square selection (U), and Pearson correlation coefficient (CM). The selection of features produces eight factors contributing to lecturer research performance are Scientific Papers (U: 154.38, CM: 0.79), Number of Citation (U: 95.86, CM: 0.70), Conference (U: 68.67, CM: 0.57), Grade (U: 10.13, CM: 0.29), Grant (U: 35.40, CM: 0.36), IPR (U: 19.81, CM: 0.27), Qualification (U: 2.57, CM: 0.26), and Grant Awardee (U: 2.66, CM: 0.26). To analyze the factors, two data mining classifiers were involved, Backpropagation Neural Networks (BPNN) and Support Vector Machine (SVM). Evaluation of the data mining classifier with an accuracy score for BPNN of 95 percent, and SVM of 92 percent. The essence of this analysis is not to find the highest accuracy score, but rather whether the factors can pass the test phase with the expected results. The findings of this study reveal the factors that have a significant impact on research performance and vice versa.

Direct Divergence Approximation between Probability Distributions and Its Applications in Machine Learning

  • Sugiyama, Masashi;Liu, Song;du Plessis, Marthinus Christoffel;Yamanaka, Masao;Yamada, Makoto;Suzuki, Taiji;Kanamori, Takafumi
    • Journal of Computing Science and Engineering
    • /
    • 제7권2호
    • /
    • pp.99-111
    • /
    • 2013
  • Approximating a divergence between two probability distributions from their samples is a fundamental challenge in statistics, information theory, and machine learning. A divergence approximator can be used for various purposes, such as two-sample homogeneity testing, change-point detection, and class-balance estimation. Furthermore, an approximator of a divergence between the joint distribution and the product of marginals can be used for independence testing, which has a wide range of applications, including feature selection and extraction, clustering, object matching, independent component analysis, and causal direction estimation. In this paper, we review recent advances in divergence approximation. Our emphasis is that directly approximating the divergence without estimating probability distributions is more sensible than a naive two-step approach of first estimating probability distributions and then approximating the divergence. Furthermore, despite the overwhelming popularity of the Kullback-Leibler divergence as a divergence measure, we argue that alternatives such as the Pearson divergence, the relative Pearson divergence, and the $L^2$-distance are more useful in practice because of their computationally efficient approximability, high numerical stability, and superior robustness against outliers.

데이터 탐색을 활용한 딥러닝 기반 제천 지역 산사태 취약성 분석 (Assessment of Landslide Susceptibility in Jecheon Using Deep Learning Based on Exploratory Data Analysis)

  • 안상아;이정현;박혁진
    • 지질공학
    • /
    • 제33권4호
    • /
    • pp.673-687
    • /
    • 2023
  • 데이터 탐색은 수집한 데이터를 다양한 각도에서 관찰 및 이해하는 과정으로 데이터 구조 및 특성 분석을 통해 데이터의 분포와 상관관계를 파악하는 과정이다. 일반적으로 산사태는 다양한 인자들에 의해 유발되고 발생 지역에 따라 유발 인자들이 미치는 영향이 상이하기 때문에 산사태 취약성 분석 이전에 데이터 탐색을 통해 유발 인자 사이의 상관관계를 파악하고 특징적인 유발 인자를 선별한다면 효과적인 분석을 수행할 수 있다. 따라서 본 연구는 데이터 탐색이 예측 모델의 성능에 미치는 결과를 확인하기 위해 두 단계에 걸친 데이터 탐색을 수행하여 인자를 선별하고, 선별된 유발 인자들 사이의 조합과 23개의 전체 유발 인자 조합을 활용하여 딥러닝 기반의 산사태 취약성 분석을 진행하였다. 데이터 탐색 과정에서는 Pearson 상관계수 heat map과 random forest의 인자 중요도 histogram을 활용하였으며, 딥러닝 기반 산사태 취약성 분석 결과의 정확도는 분석을 통해 획득한 산사태 취약 지수 값을 이용해 제작한 산사태 취약성 지도를 confusion matrix 기반의 정확도 검증 방법을 통해 분석하였다. 분석 결과, 전체 23개의 인자를 사용한 산사태 취약성 해석 결과는 55.90%의 낮은 정확도를 보였지만 한 단계의 탐색을 거쳐 선별한 13개 인자를 활용한 취약성 해석 결과는 81.25%의 분석 정확도를 보였고, 두 단계 데이터 탐색을 모두 수행하여 선별된 9개의 유발 인자를 활용한 산사태 취약성 분석 결과는 92.80%로 가장 높은 정확도를 보였다. 따라서 데이터 탐색을 통해 특징적인 유발 인자를 선별하고 분석에 활용하는 것이 산사태 취약성 분석에서 더 좋은 분석 성능을 기대할 수 있음을 확인하였다.