• Title/Summary/Keyword: Feature Variables

Search Result 361, Processing Time 0.02 seconds

Association-based Unsupervised Feature Selection for High-dimensional Categorical Data (고차원 범주형 자료를 위한 비지도 연관성 기반 범주형 변수 선택 방법)

  • Lee, Changki;Jung, Uk
    • Journal of Korean Society for Quality Management
    • /
    • v.47 no.3
    • /
    • pp.537-552
    • /
    • 2019
  • Purpose: The development of information technology makes it easy to utilize high-dimensional categorical data. In this regard, the purpose of this study is to propose a novel method to select the proper categorical variables in high-dimensional categorical data. Methods: The proposed feature selection method consists of three steps: (1) The first step defines the goodness-to-pick measure. In this paper, a categorical variable is relevant if it has relationships among other variables. According to the above definition of relevant variables, the goodness-to-pick measure calculates the normalized conditional entropy with other variables. (2) The second step finds the relevant feature subset from the original variables set. This step decides whether a variable is relevant or not. (3) The third step eliminates redundancy variables from the relevant feature subset. Results: Our experimental results showed that the proposed feature selection method generally yielded better classification performance than without feature selection in high-dimensional categorical data, especially as the number of irrelevant categorical variables increase. Besides, as the number of irrelevant categorical variables that have imbalanced categorical values is increasing, the difference in accuracy between the proposed method and the existing methods being compared increases. Conclusion: According to experimental results, we confirmed that the proposed method makes it possible to consistently produce high classification accuracy rates in high-dimensional categorical data. Therefore, the proposed method is promising to be used effectively in high-dimensional situation.

Use of Artificial Bee Swarm Optimization (ABSO) for Feature Selection in System Diagnosis for Coronary Heart Disease

  • Wiharto;Yaumi A. Z. A. Fajri;Esti Suryani;Sigit Setyawan
    • Journal of information and communication convergence engineering
    • /
    • v.21 no.2
    • /
    • pp.130-138
    • /
    • 2023
  • The selection of the correct examination variables for diagnosing heart disease provides many benefits, including faster diagnosis and lower cost of examination. The selection of inspection variables can be performed by referring to the data of previous examination results so that future investigations can be carried out by referring to these selected variables. This paper proposes a model for selecting examination variables using an Artificial Bee Swarm Optimization method by considering the variables of accuracy and cost of inspection. The proposed feature selection model was evaluated using the performance parameters of accuracy, area under curve (AUC), number of variables, and inspection cost. The test results show that the proposed model can produce 24 examination variables and provide 95.16% accuracy and 97.61% AUC. These results indicate a significant decrease in the number of inspection variables and inspection costs while maintaining performance in the excellent category.

Set Covering-based Feature Selection of Large-scale Omics Data (Set Covering 기반의 대용량 오믹스데이터 특징변수 추출기법)

  • Ma, Zhengyu;Yan, Kedong;Kim, Kwangsoo;Ryoo, Hong Seo
    • Journal of the Korean Operations Research and Management Science Society
    • /
    • v.39 no.4
    • /
    • pp.75-84
    • /
    • 2014
  • In this paper, we dealt with feature selection problem of large-scale and high-dimensional biological data such as omics data. For this problem, most of the previous approaches used simple score function to reduce the number of original variables and selected features from the small number of remained variables. In the case of methods that do not rely on filtering techniques, they do not consider the interactions between the variables, or generate approximate solutions to the simplified problem. Unlike them, by combining set covering and clustering techniques, we developed a new method that could deal with total number of variables and consider the combinatorial effects of variables for selecting good features. To demonstrate the efficacy and effectiveness of the method, we downloaded gene expression datasets from TCGA (The Cancer Genome Atlas) and compared our method with other algorithms including WEKA embeded feature selection algorithms. In the experimental results, we showed that our method could select high quality features for constructing more accurate classifiers than other feature selection algorithms.

Feature selection for text data via sparse principal component analysis (희소주성분분석을 이용한 텍스트데이터의 단어선택)

  • Won Son
    • The Korean Journal of Applied Statistics
    • /
    • v.36 no.6
    • /
    • pp.501-514
    • /
    • 2023
  • When analyzing high dimensional data such as text data, if we input all the variables as explanatory variables, statistical learning procedures may suffer from over-fitting problems. Furthermore, computational efficiency can deteriorate with a large number of variables. Dimensionality reduction techniques such as feature selection or feature extraction are useful for dealing with these problems. The sparse principal component analysis (SPCA) is one of the regularized least squares methods which employs an elastic net-type objective function. The SPCA can be used to remove insignificant principal components and identify important variables from noisy observations. In this study, we propose a dimension reduction procedure for text data based on the SPCA. Applying the proposed procedure to real data, we find that the reduced feature set maintains sufficient information in text data while the size of the feature set is reduced by removing redundant variables. As a result, the proposed procedure can improve classification accuracy and computational efficiency, especially for some classifiers such as the k-nearest neighbors algorithm.

A study on the prediction of optimized injection molding conditions and the feature selection using the Artificial Neural Network(ANN) (인공신경망을 통한 사출 성형조건의 최적화 예측 및 특성 선택에 관한 연구)

  • Yang, Dong-Cheol;Kim, Jong-Sun
    • Design & Manufacturing
    • /
    • v.16 no.3
    • /
    • pp.50-57
    • /
    • 2022
  • The qualities of the products produced by injection molding are strongly influenced by the process variables of the injection molding machine set by the engineer. It is very difficult to predict the qualities of the injection molded product considering the stochastic nature of the manufacturing process, since the processing conditions have a complex impact on the quality of the injection molded product. It is recognized that the artificial neural network(ANN) is capable of mapping the intricate relationship between the input and output variables very accurately, therefore, many studies are being conducted to predict the relationship between the results of the product and the process variables using ANN. However in the condition of a small number of data sets, the predicting performance and robustness of the ANN model could be reduced due to too many input variables. In the present study, the ANN model that predicts the length of the injection molded product for multiple combinations of process variables was developed. And the accuracy of each ANN model was compared for 8 process variables and 4 important process inputs that were determined by the feature selection. Based on the comparison, it was verified that the performance of the ANN model increased when only 4 important variables were applied.

The Credit Information Feature Selection Method in Default Rate Prediction Model for Individual Businesses (개인사업자 부도율 예측 모델에서 신용정보 특성 선택 방법)

  • Hong, Dongsuk;Baek, Hanjong;Shin, Hyunjoon
    • Journal of the Korea Society for Simulation
    • /
    • v.30 no.1
    • /
    • pp.75-85
    • /
    • 2021
  • In this paper, we present a deep neural network-based prediction model that processes and analyzes the corporate credit and personal credit information of individual business owners as a new method to predict the default rate of individual business more accurately. In modeling research in various fields, feature selection techniques have been actively studied as a method for improving performance, especially in predictive models including many features. In this paper, after statistical verification of macroeconomic indicators (macro variables) and credit information (micro variables), which are input variables used in the default rate prediction model, additionally, through the credit information feature selection method, the final feature set that improves prediction performance was identified. The proposed credit information feature selection method as an iterative & hybrid method that combines the filter-based and wrapper-based method builds submodels, constructs subsets by extracting important variables of the maximum performance submodels, and determines the final feature set through prediction performance analysis of the subset and the subset combined set.

A Preliminary Study on Correlation between Voice Characteristics and Speech Features (목소리 특성의 주관적 평가와 음성 특징과의 상관관계 기초연구)

  • Han, Sung-Man;Kim, Sang-Beom;Kim, Jong-Yeol;Kwon, Chul-Hong
    • Phonetics and Speech Sciences
    • /
    • v.3 no.4
    • /
    • pp.85-91
    • /
    • 2011
  • Sasang constitution medicine utilizes voice characteristics to diagnose a person's constitution. To classify Sasang constitutional groups using speech information technology, this study aims at establishing the relationship between Sasang constitutional groups and their corresponding voice characteristics by investigating various speech feature variables. The speech variables include features related to speech source and vocal tract filter. Experimental results show that statistically significant correlation between voice characteristics and some speech feature variables is observed.

  • PDF

Analyzing empirical performance of correlation based feature selection with company credit rank score dataset - Emphasis on KOSPI manufacturing companies -

  • Nam, Youn Chang;Lee, Kun Chang
    • Journal of the Korea Society of Computer and Information
    • /
    • v.21 no.4
    • /
    • pp.63-71
    • /
    • 2016
  • This paper is about applying efficient data mining method which improves the score calculation and proper building performance of credit ranking score system. The main idea of this data mining technique is accomplishing such objectives by applying Correlation based Feature Selection which could also be used to verify the properness of existing rank scores quickly. This study selected 2047 manufacturing companies on KOSPI market during the period of 2009 to 2013, which have their own credit rank scores given by NICE information service agency. Regarding the relevant financial variables, total 80 variables were collected from KIS-Value and DART (Data Analysis, Retrieval and Transfer System). If correlation based feature selection could select more important variables, then required information and cost would be reduced significantly. Through analysis, this study show that the proposed correlation based feature selection method improves selection and classification process of credit rank system so that the accuracy and credibility would be increased while the cost for building system would be decreased.

A Comparative Study on the Methodology of Failure Detection of Reefer Containers Using PCA and Feature Importance (PCA 및 변수 중요도를 활용한 냉동컨테이너 고장 탐지 방법론 비교 연구)

  • Lee, Seunghyun;Park, Sungho;Lee, Seungjae;Lee, Huiwon;Yu, Sungyeol;Lee, Kangbae
    • Journal of the Korea Convergence Society
    • /
    • v.13 no.3
    • /
    • pp.23-31
    • /
    • 2022
  • This study analyzed the actual frozen container operation data of Starcool provided by H Shipping. Through interviews with H's field experts, only Critical and Fatal Alarms among the four failure alarms were defined as failures, and it was confirmed that using all variables due to the nature of frozen containers resulted in cost inefficiency. Therefore, this study proposes a method for detecting failure of frozen containers through characteristic importance and PCA techniques. To improve the performance of the model, we select variables based on feature importance through tree series models such as XGBoost and LGBoost, and use PCA to reduce the dimension of the entire variables for each model. The boosting-based XGBoost and LGBoost techniques showed that the results of the model proposed in this study improved the reproduction rate by 0.36 and 0.39 respectively compared to the results of supervised learning using all 62 variables.

A Study on the Extraction of Feature Variables for the Pattern Recognition of Welding Flaws (용접결함의 형상인식을 위한 특징변수 추출에 관한 연구)

  • Kim, Jae-Yeol;Roh, Byung-Ok;You, Sin;Kim, Chang-Hyun;Ko, Myung-Soo
    • Journal of the Korean Society for Precision Engineering
    • /
    • v.19 no.11
    • /
    • pp.103-111
    • /
    • 2002
  • In this study, the natural flaws in welding parts are classified using the signal pattern classification method. The storage digital oscilloscope including FFT function and enveloped waveform generator is used and the signal pattern recognition procedure is made up the digital signal processing, feature extraction, feature selection and classifier design. It is composed with and discussed using the distance classifier that is based on euclidean distance the empirical Bayesian classifier. feature extraction is performed using the class-mean scatter criteria. The signal pattern classification method is applied to the signal pattern recognition of natural flaws.