• 제목/요약/키워드: Feature selection

Search Result 1,064, Processing Time 0.031 seconds

An Empirical Study on Improving the Performance of Text Categorization Considering the Relationships between Feature Selection Criteria and Weighting Methods (자질 선정 기준과 가중치 할당 방식간의 관계를 고려한 문서 자동분류의 개선에 대한 연구)

  • Lee Jae-Yun
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.39 no.2
    • /
    • pp.123-146
    • /
    • 2005
  • This study aims to find consistent strategies for feature selection and feature weighting methods, which can improve the effectiveness and efficiency of kNN text classifier. Feature selection criteria and feature weighting methods are as important factor as classification algorithms to achieve good performance of text categorization systems. Most of the former studies chose conflicting strategies for feature selection criteria and weighting methods. In this study, the performance of several feature selection criteria are measured considering the storage space for inverted index records and the classification time. The classification experiments in this study are conducted to examine the performance of IDF as feature selection criteria and the performance of conventional feature selection criteria, e.g. mutual information, as feature weighting methods. The results of these experiments suggest that using those measures which prefer low-frequency features as feature selection criterion and also as feature weighting method. we can increase the classification speed up to three or five times without loosing classification accuracy.

Landslide susceptibility assessment using feature selection-based machine learning models

  • Liu, Lei-Lei;Yang, Can;Wang, Xiao-Mi
    • Geomechanics and Engineering
    • /
    • v.25 no.1
    • /
    • pp.1-16
    • /
    • 2021
  • Machine learning models have been widely used for landslide susceptibility assessment (LSA) in recent years. The large number of inputs or conditioning factors for these models, however, can reduce the computation efficiency and increase the difficulty in collecting data. Feature selection is a good tool to address this problem by selecting the most important features among all factors to reduce the size of the input variables. However, two important questions need to be solved: (1) how do feature selection methods affect the performance of machine learning models? and (2) which feature selection method is the most suitable for a given machine learning model? This paper aims to address these two questions by comparing the predictive performance of 13 feature selection-based machine learning (FS-ML) models and 5 ordinary machine learning models on LSA. First, five commonly used machine learning models (i.e., logistic regression, support vector machine, artificial neural network, Gaussian process and random forest) and six typical feature selection methods in the literature are adopted to constitute the proposed models. Then, fifteen conditioning factors are chosen as input variables and 1,017 landslides are used as recorded data. Next, feature selection methods are used to obtain the importance of the conditioning factors to create feature subsets, based on which 13 FS-ML models are constructed. For each of the machine learning models, a best optimized FS-ML model is selected according to the area under curve value. Finally, five optimal FS-ML models are obtained and applied to the LSA of the studied area. The predictive abilities of the FS-ML models on LSA are verified and compared through the receive operating characteristic curve and statistical indicators such as sensitivity, specificity and accuracy. The results showed that different feature selection methods have different effects on the performance of LSA machine learning models. FS-ML models generally outperform the ordinary machine learning models. The best FS-ML model is the recursive feature elimination (RFE) optimized RF, and RFE is an optimal method for feature selection.

Genetic Algorithm Based Feature Selection Method Development for Pattern Recognition (패턴 인식문제를 위한 유전자 알고리즘 기반 특징 선택 방법 개발)

  • Park Chang-Hyun;Kim Ho-Duck;Yang Hyun-Chang;Sim Kwee-Bo
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.16 no.4
    • /
    • pp.466-471
    • /
    • 2006
  • IAn important problem of pattern recognition is to extract or select feature set, which is included in the pre-processing stage. In order to extract feature set, Principal component analysis has been usually used and SFS(Sequential Forward Selection) and SBS(Sequential Backward Selection) have been used as a feature selection method. This paper applies genetic algorithm which is a popular method for nonlinear optimization problem to the feature selection problem. So, we call it Genetic Algorithm Feature Selection(GAFS) and this algorithm is compared to other methods in the performance aspect.

Reinforcement Learning Method Based Interactive Feature Selection(IFS) Method for Emotion Recognition (감성 인식을 위한 강화학습 기반 상호작용에 의한 특징선택 방법 개발)

  • Park Chang-Hyun;Sim Kwee-Bo
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.12 no.7
    • /
    • pp.666-670
    • /
    • 2006
  • This paper presents the novel feature selection method for Emotion Recognition, which may include a lot of original features. Specially, the emotion recognition in this paper treated speech signal with emotion. The feature selection has some benefits on the pattern recognition performance and 'the curse of dimension'. Thus, We implemented a simulator called 'IFS' and those result was applied to a emotion recognition system(ERS), which was also implemented for this research. Our novel feature selection method was basically affected by Reinforcement Learning and since it needs responses from human user, it is called 'Interactive feature Selection'. From performing the IFS, we could get 3 best features and applied to ERS. Comparing those results with randomly selected feature set, The 3 best features were better than the randomly selected feature set.

Comparing Korean Spam Document Classification Using Document Classification Algorithms (문서 분류 알고리즘을 이용한 한국어 스팸 문서 분류 성능 비교)

  • Song, Chull-Hwan;Yoo, Seong-Joon
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2006.10c
    • /
    • pp.222-225
    • /
    • 2006
  • 한국은 다른 나라에 비해 많은 인터넷 사용자를 가지고 있다. 이에 비례해서 한국의 인터넷 유저들은 Spam Mail에 대해 많은 불편함을 호소하고 있다. 이러한 문제를 해결하기 위해 본 논문은 다양한 Feature Weighting, Feature Selection 그리고 문서 분류 알고리즘들을 이용한 한국어 스팸 문서 Filtering연구에 대해 기술한다. 그리고 한국어 문서(Spam/Non-Spam 문서)로부터 영사를 추출하고 이를 각 분류 알고리즘의 Input Feature로써 이용한다. 그리고 우리는 Feature weighting 에 대해 기존의 전통적인 방법이 아니라 각 Feature에 대해 Variance 값을 구하고 Global Feature를 선택하기 위해 Max Value Selection 방법에 적용 후에 전통적인 Feature Selection 방법인 MI, IG, CHI 들을 적용하여 Feature들을 추출한다. 이렇게 추출된 Feature들을 Naive Bayes, Support Vector Machine과 같은 분류 알고리즘에 적용한다. Vector Space Model의 경우에는 전통적인 방법 그대로 사용한다. 그 결과 우리는 Support Vector Machine Classifier, TF-IDF Variance Weighting(Combined Max Value Selection), CHI Feature Selection 방법을 사용할 경우 Recall(99.4%), Precision(97.4%), F-Measure(98.39%)의 성능을 보였다.

  • PDF

Optimal k-Nearest Neighborhood Classifier Using Genetic Algorithm (유전알고리즘을 이용한 최적 k-최근접이웃 분류기)

  • Park, Chong-Sun;Huh, Kyun
    • Communications for Statistical Applications and Methods
    • /
    • v.17 no.1
    • /
    • pp.17-27
    • /
    • 2010
  • Feature selection and feature weighting are useful techniques for improving the classification accuracy of k-Nearest Neighbor (k-NN) classifier. The main propose of feature selection and feature weighting is to reduce the number of features, by eliminating irrelevant and redundant features, while simultaneously maintaining or enhancing classification accuracy. In this paper, a novel hybrid approach is proposed for simultaneous feature selection, feature weighting and choice of k in k-NN classifier based on Genetic Algorithm. The results have indicated that the proposed algorithm is quite comparable with and superior to existing classifiers with or without feature selection and feature weighting capability.

Feature Selection of Fuzzy Pattern Classifier by using Fuzzy Mapping (퍼지 매핑을 이용한 퍼지 패턴 분류기의 Feature Selection)

  • Roh, Seok-Beom;Kim, Yong Soo;Ahn, Tae-Chon
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.24 no.6
    • /
    • pp.646-650
    • /
    • 2014
  • In this paper, in order to avoid the deterioration of the pattern classification performance which results from the curse of dimensionality, we propose a new feature selection method. The newly proposed feature selection method is based on Fuzzy C-Means clustering algorithm which analyzes the data points to divide them into several clusters and the concept of a function with fuzzy numbers. When it comes to the concept of a function where independent variables are fuzzy numbers and a dependent variable is a label of class, a fuzzy number should be related to the only one class label. Therefore, a good feature is a independent variable of a function with fuzzy numbers. Under this assumption, we calculate the goodness of each feature to pattern classification problem. Finally, in order to evaluate the classification ability of the proposed pattern classifier, the machine learning data sets are used.

Feature Combination and Selection Using Genetic Algorithm for Character Recognition (유전 알고리즘을 이용한 특징 결합과 선택)

  • Lee Jin-Seon
    • The Journal of the Korea Contents Association
    • /
    • v.5 no.5
    • /
    • pp.152-158
    • /
    • 2005
  • By using a combination of different feature sets extracted from input character patterns, we can improve the character recognition system performance. To reduce the dimensionality of the combined feature vector, we conduct the feature selection. This paper proposes a general framework for the feature combination and selection for character recognition problems. It also presents a specific design for the handwritten numeral recognition. Tn the design, DDD and AGD feature sets are extracted from handwritten numeral patterns, and a genetic algorithm is used for the feature selection. Experimental result showed a significant accuracy improvement by about 0.7% for the CENPARMI handwrittennumeral database.

  • PDF

Comparison of Feature Selection Methods in Support Vector Machines (지지벡터기계의 변수 선택방법 비교)

  • Kim, Kwangsu;Park, Changyi
    • The Korean Journal of Applied Statistics
    • /
    • v.26 no.1
    • /
    • pp.131-139
    • /
    • 2013
  • Support vector machines(SVM) may perform poorly in the presence of noise variables; in addition, it is difficult to identify the importance of each variable in the resulting classifier. A feature selection can improve the interpretability and the accuracy of SVM. Most existing studies concern feature selection in the linear SVM through penalty functions yielding sparse solutions. Note that one usually adopts nonlinear kernels for the accuracy of classification in practice. Hence feature selection is still desirable for nonlinear SVMs. In this paper, we compare the performances of nonlinear feature selection methods such as component selection and smoothing operator(COSSO) and kernel iterative feature extraction(KNIFE) on simulated and real data sets.

Effective Multi-label Feature Selection based on Large Offspring Set created by Enhanced Evolutionary Search Process

  • Lim, Hyunki;Seo, Wangduk;Lee, Jaesung
    • Journal of the Korea Society of Computer and Information
    • /
    • v.23 no.9
    • /
    • pp.7-13
    • /
    • 2018
  • Recent advancement in data gathering technique improves the capability of information collecting, thus allowing the learning process between gathered data patterns and application sub-tasks. A pattern can be associated with multiple labels, demanding multi-label learning capability, resulting in significant attention to multi-label feature selection since it can improve multi-label learning accuracy. However, existing evolutionary multi-label feature selection methods suffer from ineffective search process. In this study, we propose a evolutionary search process for the task of multi-label feature selection problem. The proposed method creates large set of offspring or new feature subsets and then retains the most promising feature subset. Experimental results demonstrate that the proposed method can identify feature subsets giving good multi-label classification accuracy much faster than conventional methods.