• 제목/요약/키워드: correlation based feature selection

검색결과 55건 처리시간 0.023초

Analyzing empirical performance of correlation based feature selection with company credit rank score dataset - Emphasis on KOSPI manufacturing companies -

  • Nam, Youn Chang;Lee, Kun Chang
    • 한국컴퓨터정보학회논문지
    • /
    • 제21권4호
    • /
    • pp.63-71
    • /
    • 2016
  • This paper is about applying efficient data mining method which improves the score calculation and proper building performance of credit ranking score system. The main idea of this data mining technique is accomplishing such objectives by applying Correlation based Feature Selection which could also be used to verify the properness of existing rank scores quickly. This study selected 2047 manufacturing companies on KOSPI market during the period of 2009 to 2013, which have their own credit rank scores given by NICE information service agency. Regarding the relevant financial variables, total 80 variables were collected from KIS-Value and DART (Data Analysis, Retrieval and Transfer System). If correlation based feature selection could select more important variables, then required information and cost would be reduced significantly. Through analysis, this study show that the proposed correlation based feature selection method improves selection and classification process of credit rank system so that the accuracy and credibility would be increased while the cost for building system would be decreased.

머신러닝 기반 CFS(Correlation-based Feature Selection)기법과 Random Forest모델을 활용한 BMI(Benthic Macroinvertebrate Index) 예측에 관한 연구 (A Study on the prediction of BMI(Benthic Macroinvertebrate Index) using Machine Learning Based CFS(Correlation-based Feature Selection) and Random Forest Model)

  • 고우석;윤춘경;이한필;황순진;이상우
    • 한국물환경학회지
    • /
    • 제35권5호
    • /
    • pp.425-431
    • /
    • 2019
  • Recently, people have been attracting attention to the good quality of water resources as well as water welfare. to improve the quality of life. This study is a papers on the prediction of benthic macroinvertebrate index (BMI), which is a aquatic ecological health, using the machine learning based CFS (Correlation-based Feature Selection) method and the random forest model to compare the measured and predicted values of the BMI. The data collected from the Han River's branch for 10 years are extracted and utilized in 1312 data. Through the utilized data, Pearson correlation analysis showed a lack of correlation between single factor and BMI. The CFS method for multiple regression analysis was introduced. This study calculated 10 factors(water temperature, DO, electrical conductivity, turbidity, BOD, $NH_3-N$, T-N, $PO_4-P$, T-P, Average flow rate) that are considered to be related to the BMI. The random forest model was used based on the ten factors. In order to prove the validity of the model, $R^2$, %Difference, NSE (Nash-Sutcliffe Efficiency) and RMSE (Root Mean Square Error) were used. Each factor was 0.9438, -0.997, and 0,992, and accuracy rate was 71.6% level. As a result, These results can suggest the future direction of water resource management and Pre-review function for water ecological prediction.

SVM 기반 자동 품질검사 시스템에서 상관분석 기반 데이터 선정 연구 (Study on Correlation-based Feature Selection in an Automatic Quality Inspection System using Support Vector Machine (SVM))

  • 송동환;오영광;김남훈
    • 대한산업공학회지
    • /
    • 제42권6호
    • /
    • pp.370-376
    • /
    • 2016
  • Manufacturing data analysis and its applications are getting a huge popularity in various industries. In spite of the fast advancement in the big data analysis technology, however, the manufacturing quality data monitored from the automated inspection system sometimes is not reliable enough due to the complex patterns of product quality. In this study, thus, we aim to define the level of trusty of an automated quality inspection system and improve the reliability of the quality inspection data. By correlation analysis and feature selection, this paper presents a method of improving the inspection accuracy and efficiency in an SVM-based automatic product quality inspection system using thermal image data in an auto part manufacturing case. The proposed method is implemented in the sealer dispensing process of the automobile manufacturing and verified by the analysis of the optimal feature selection from the quality analysis results.

Gait-Based Gender Classification Using a Correlation-Based Feature Selection Technique

  • Beom Kwon
    • 한국컴퓨터정보학회논문지
    • /
    • 제29권3호
    • /
    • pp.55-66
    • /
    • 2024
  • 성별 분류 기술은 법의학, 감시 시스템, 인구 통계 연구 등 다양한 분야에서 활용될 수 있기 때문에, 연구자들로부터 많은 관심을 받고 있다. 남성과 여성의 보행 사이에는 서로 구별되는 특징이 있다는 것이 기존 연구들에서 밝혀지면서, 3차원 보행 데이터에서 성별을 분류하는 다양한 기술들이 제안됐다. 하지만, 기존 기술들을 사용해 3차원 보행 데이터로부터 추출한 보행 특징 중에는 서로 유사 또는 중복되거나 성별 분류에 도움이 되지 않는 특징들도 있다. 이에 본 연구에서는 상관관계 기반 특징 선별 기술을 활용해, 성별 분류에 도움이 되는 특징들을 선별하는 방법을 제안한다. 그리고 제안하는 특징 선별 기술의 효용성을 입증하기 위해서, 인터넷상에 공개된 3차원 보행 데이터 세트(Dataset)를 활용하여 제안하는 특징 선별 기술을 적용하기 전과 후에 대해 성별 분류 모델들의 성능을 비교 분석하였다. 실험에는 이진 분류 문제에 적용할 수 있는 여덟 가지의 머신러닝 알고리즘(Machine Learning Algorithms)을 활용하였다. 실험 결과, 제안하는 특징 선별 기술을 사용하면 성별 분류 성능은 유지하면서, 특징의 개수를 82개에서 60개까지, 22개를 줄일 수 있다는 것을 입증하였다.

Hybrid Feature Selection Method Based on Genetic Algorithm for the Diagnosis of Coronary Heart Disease

  • Wiharto, Wiharto;Suryani, Esti;Setyawan, Sigit;Putra, Bintang PE
    • Journal of information and communication convergence engineering
    • /
    • 제20권1호
    • /
    • pp.31-40
    • /
    • 2022
  • Coronary heart disease (CHD) is a comorbidity of COVID-19; therefore, routine early diagnosis is crucial. A large number of examination attributes in the context of diagnosing CHD is a distinct obstacle during the pandemic when the number of health service users is significant. The development of a precise machine learning model for diagnosis with a minimum number of examination attributes can allow examinations and healthcare actions to be undertaken quickly. This study proposes a CHD diagnosis model based on feature selection, data balancing, and ensemble-based classification methods. In the feature selection stage, a hybrid SVM-GA combined with fast correlation-based filter (FCBF) is used. The proposed system achieved an accuracy of 94.60% and area under the curve (AUC) of 97.5% when tested on the z-Alizadeh Sani dataset and used only 8 of 54 inspection attributes. In terms of performance, the proposed model can be placed in the very good category.

단백체 스펙트럼 데이터의 분류를 위한 랜덤 포리스트 기반 특성 선택 알고리즘 (Feature Selection for Classification of Mass Spectrometric Proteomic Data Using Random Forest)

  • 온승엽;지승도;한미영
    • 한국시뮬레이션학회논문지
    • /
    • 제22권4호
    • /
    • pp.139-147
    • /
    • 2013
  • 본 논문에서는 질량 분석 방법에 의하여 산출된 단백체 데이터(mass spectrometric proteomic data)의 분류 분석(classification analysis)을 위한 새로운 특성 선택(feature selection) 방법을 제안한다. 이 방법은 i)높은 상관관계를 가지는 중복된 특성을 효과적으로 제거하는 전처리 단계와 ii)토너먼트(tournament) 전략을 사용하여 최적 특성 부분집합(optimal feature subset)을 탐색해 내는 단계로 구성되어 있다. 제안되는 방법을 실제 암진단에 사용되는 공개된 혈액 단백체 데이터에 적용하였으며 널리 사용되는 타 방법과 비교할 때 우수한 성능과 균형된 특이도와 민감도를 달성함을 실증하였다.

Correlation-based Feature Selection 기법과 Random Forest 알고리즘을 이용한 한강유역 지류의 TDI 예측 연구 (A Study on Predicting TDI(Trophic Diatom Index) in tributaries of Han river basin using Correlation-based Feature Selection technique and Random Forest algorithm)

  • 김민규;윤춘경;이한필;황순진;이상우
    • 한국물환경학회지
    • /
    • 제35권5호
    • /
    • pp.432-438
    • /
    • 2019
  • The purpose of this study is to predict Trophic Diatom Index (TDI) in tributaries of the Han River watershed using the random forest algorithm. The one year (2017) and supplied aquatic ecology health data were used. The data includes water quality(BOD, T-N, $NH_3-N$, T-P, $PO_4-P$, water temperature, DO, pH, conductivity, turbidity), hydraulic factors(water width, average water depth, average velocity of water), and TDI score. Seven factors including water temperature, BOD, T-N, $NH_3-N$, T-P, $PO_4-P$, and average water depth are selected by the Correlation Feature Selection. A TDI prediction model was generated by random forest using the seven factors. To evaluate this model, 2017 data set was used first. As a result of the evaluation, $R^2$, % Difference, NSE(Nash-Sutcliffe Efficiency), RMSE(Root Mean Square Error) and accuracy rate show that this model is compatible with predicting TDI. To be more concrete, $R^2$ is 0.93, % Difference is -0.37, NSE is 0.89, RMSE is 8.22 and accuracy rate is 70.4%. Also, additional evaluation using data set more than 17 times the measured point was performed. The results were similar when the 2017 data set were used. The Wilcoxon Signed Ranks Test shows there was no statistically significant difference between actual and predicted data for the 2017 data set. These results can specify the elements which probably affect aquatic ecology health. Also, these will provide direction relative to water quality management for a watershed that must be continuously preserved.

Category Factor Based Feature Selection for Document Classification

  • Kang Yun-Hee
    • International Journal of Contents
    • /
    • 제1권2호
    • /
    • pp.26-30
    • /
    • 2005
  • According to the fast growth of information on the Internet, it is becoming increasingly difficult to find and organize useful information. To reduce information overload, it needs to exploit automatic text classification for handling enormous documents. Support Vector Machine (SVM) is a model that is calculated as a weighted sum of kernel function outputs. This paper describes a document classifier for web documents in the fields of Information Technology and uses SVM to learn a model, which is constructed from the training sets and its representative terms. The basic idea is to exploit the representative terms meaning distribution in coherent thematic texts of each category by simple statistics methods. Vector-space model is applied to represent documents in the categories by using feature selection scheme based on TFiDF. We apply a category factor which represents effects in category of any term to the feature selection. Experiments show the results of categorization and the correlation of vector length.

  • PDF

기계학습을 이용한 밴드갭 예측과 소재의 조성기반 특성인자의 효과 (Compositional Feature Selection and Its Effects on Bandgap Prediction by Machine Learning)

  • 남충희
    • 한국재료학회지
    • /
    • 제33권4호
    • /
    • pp.164-174
    • /
    • 2023
  • The bandgap characteristics of semiconductor materials are an important factor when utilizing semiconductor materials for various applications. In this study, based on data provided by AFLOW (Automatic-FLOW for Materials Discovery), the bandgap of a semiconductor material was predicted using only the material's compositional features. The compositional features were generated using the python module of 'Pymatgen' and 'Matminer'. Pearson's correlation coefficients (PCC) between the compositional features were calculated and those with a correlation coefficient value larger than 0.95 were removed in order to avoid overfitting. The bandgap prediction performance was compared using the metrics of R2 score and root-mean-squared error. By predicting the bandgap with randomforest and xgboost as representatives of the ensemble algorithm, it was found that xgboost gave better results after cross-validation and hyper-parameter tuning. To investigate the effect of compositional feature selection on the bandgap prediction of the machine learning model, the prediction performance was studied according to the number of features based on feature importance methods. It was found that there were no significant changes in prediction performance beyond the appropriate feature. Furthermore, artificial neural networks were employed to compare the prediction performance by adjusting the number of features guided by the PCC values, resulting in the best R2 score of 0.811. By comparing and analyzing the bandgap distribution and prediction performance according to the material group containing specific elements (F, N, Yb, Eu, Zn, B, Si, Ge, Fe Al), various information for material design was obtained.

퍼지 클러스터 분석 기반 특징 선택 방법 (A Feature Selection Method Based on Fuzzy Cluster Analysis)

  • 이현숙
    • 정보처리학회논문지B
    • /
    • 제14B권2호
    • /
    • pp.135-140
    • /
    • 2007
  • 특징선택은 문제 영역에서 관찰된 다차원데이터로부터 데이터가 묘사하는 구조를 잘 반영하는 속성을 선택하여 효과적인 실험 데이터를 구성하는 데이터 준비과정이다. 이 과정은 문서분류, 영상인식, 유전자 선택 분야에서의 같은 분류시스템의 성능향상에 중요한 구성요소로서 상관관계 기법, 차원축소 및 상호 정보 처리 등의 통계학이나 정보이론의 접근방법을 중심으로 연구되어왔다. 이와 같은 선택 분야의 연구는 다루는 데이터의 양이 방대해지고 복잡해지면서 더욱 중요시 되고 있다. 본 논문에서는 데이터가 가지는 특성을 반영하면서 새로운 데이터에 대하여 일반화 할 수 있는 특징선택 방법을 제안하고자 한다. 준비된 데이터의 각 속성 데이터에 대하여 퍼지 클러스터 분석에 의하여 최적의 클러스터 정보를 얻고 이를 바탕으로 근접성과 분리성의 경로를 측정하여 그 값에 따라 특징을 선택하는 매카니즘을 제공한다. 제안된 방법을 실세계의 컴퓨터 바이러스 분류에 적용하여 기존의 대비에 의한 휴리스틱 방법에 의해 선택된 데이터를 가지고 분류한 것과 비교하고자 한다. 이를 통하여 주어진 특징에 시연을 부여할 수 있고 효과적으로 특징을 선택하여 시스템의 성능을 향상 시킬 수 있음을 확인한다.