• 제목/요약/키워드: Feature Variables

검색결과 363건 처리시간 0.032초

고차원 범주형 자료를 위한 비지도 연관성 기반 범주형 변수 선택 방법 (Association-based Unsupervised Feature Selection for High-dimensional Categorical Data)

  • 이창기;정욱
    • 품질경영학회지
    • /
    • 제47권3호
    • /
    • pp.537-552
    • /
    • 2019
  • Purpose: The development of information technology makes it easy to utilize high-dimensional categorical data. In this regard, the purpose of this study is to propose a novel method to select the proper categorical variables in high-dimensional categorical data. Methods: The proposed feature selection method consists of three steps: (1) The first step defines the goodness-to-pick measure. In this paper, a categorical variable is relevant if it has relationships among other variables. According to the above definition of relevant variables, the goodness-to-pick measure calculates the normalized conditional entropy with other variables. (2) The second step finds the relevant feature subset from the original variables set. This step decides whether a variable is relevant or not. (3) The third step eliminates redundancy variables from the relevant feature subset. Results: Our experimental results showed that the proposed feature selection method generally yielded better classification performance than without feature selection in high-dimensional categorical data, especially as the number of irrelevant categorical variables increase. Besides, as the number of irrelevant categorical variables that have imbalanced categorical values is increasing, the difference in accuracy between the proposed method and the existing methods being compared increases. Conclusion: According to experimental results, we confirmed that the proposed method makes it possible to consistently produce high classification accuracy rates in high-dimensional categorical data. Therefore, the proposed method is promising to be used effectively in high-dimensional situation.

Use of Artificial Bee Swarm Optimization (ABSO) for Feature Selection in System Diagnosis for Coronary Heart Disease

  • Wiharto;Yaumi A. Z. A. Fajri;Esti Suryani;Sigit Setyawan
    • Journal of information and communication convergence engineering
    • /
    • 제21권2호
    • /
    • pp.130-138
    • /
    • 2023
  • The selection of the correct examination variables for diagnosing heart disease provides many benefits, including faster diagnosis and lower cost of examination. The selection of inspection variables can be performed by referring to the data of previous examination results so that future investigations can be carried out by referring to these selected variables. This paper proposes a model for selecting examination variables using an Artificial Bee Swarm Optimization method by considering the variables of accuracy and cost of inspection. The proposed feature selection model was evaluated using the performance parameters of accuracy, area under curve (AUC), number of variables, and inspection cost. The test results show that the proposed model can produce 24 examination variables and provide 95.16% accuracy and 97.61% AUC. These results indicate a significant decrease in the number of inspection variables and inspection costs while maintaining performance in the excellent category.

Set Covering 기반의 대용량 오믹스데이터 특징변수 추출기법 (Set Covering-based Feature Selection of Large-scale Omics Data)

  • 마정우;안기동;김광수;류홍서
    • 한국경영과학회지
    • /
    • 제39권4호
    • /
    • pp.75-84
    • /
    • 2014
  • In this paper, we dealt with feature selection problem of large-scale and high-dimensional biological data such as omics data. For this problem, most of the previous approaches used simple score function to reduce the number of original variables and selected features from the small number of remained variables. In the case of methods that do not rely on filtering techniques, they do not consider the interactions between the variables, or generate approximate solutions to the simplified problem. Unlike them, by combining set covering and clustering techniques, we developed a new method that could deal with total number of variables and consider the combinatorial effects of variables for selecting good features. To demonstrate the efficacy and effectiveness of the method, we downloaded gene expression datasets from TCGA (The Cancer Genome Atlas) and compared our method with other algorithms including WEKA embeded feature selection algorithms. In the experimental results, we showed that our method could select high quality features for constructing more accurate classifiers than other feature selection algorithms.

희소주성분분석을 이용한 텍스트데이터의 단어선택 (Feature selection for text data via sparse principal component analysis)

  • 손원
    • 응용통계연구
    • /
    • 제36권6호
    • /
    • pp.501-514
    • /
    • 2023
  • 텍스트데이터는 일반적으로 많은 단어로 이루어져 있다. 텍스트데이터와 같이 많은 변수로 구성된 데이터의 경우 과적합 등의 문제로 분석에 있어서의 정확성이 떨어지고, 계산과정에서의 효율성에도 문제가 발생하는 경우를 흔히 볼 수 있다. 이렇게 변수가 많은 데이터를 분석하기 위해 특징선택, 특징추출 등의 차원 축소 기법이 자주 사용되고 있다. 희소주성분분석은 벌점이 부여된 최소제곱법 중 하나로 엘라스틱넷 형태의 목적함수를 사용하여 유용하지 않은 주성분을 제거하고 각 주성분에서도 중요도가 큰 변수만 식별해내기 위해 활용되고 있다. 이 연구에서는 희소주성분분석을 이용하여 많은 변수를 가진 텍스트데이터를 소수의 변수만으로 요약하는 절차를 제안한다. 이러한 절차를 실제 데이터에 적용한 결과, 희소주성분분석을 이용하여 단어를 선택하는 과정을 통해 목표변수에 대한 정보를 이용하지 않고도 유용성이 낮은 단어를 제거하여 텍스트데이터의 분류 정확성은 유지하면서 데이터의 차원을 축소할 수 있음을 확인하였다. 특히 차원축소를 통해 고차원 데이터 분석에서 분류 정확도가 저하되는 KNN 분류기 등의 분류 성능을 개선할 수 있음을 알 수 있었다.

인공신경망을 통한 사출 성형조건의 최적화 예측 및 특성 선택에 관한 연구 (A study on the prediction of optimized injection molding conditions and the feature selection using the Artificial Neural Network(ANN))

  • 양동철;김종선
    • Design & Manufacturing
    • /
    • 제16권3호
    • /
    • pp.50-57
    • /
    • 2022
  • The qualities of the products produced by injection molding are strongly influenced by the process variables of the injection molding machine set by the engineer. It is very difficult to predict the qualities of the injection molded product considering the stochastic nature of the manufacturing process, since the processing conditions have a complex impact on the quality of the injection molded product. It is recognized that the artificial neural network(ANN) is capable of mapping the intricate relationship between the input and output variables very accurately, therefore, many studies are being conducted to predict the relationship between the results of the product and the process variables using ANN. However in the condition of a small number of data sets, the predicting performance and robustness of the ANN model could be reduced due to too many input variables. In the present study, the ANN model that predicts the length of the injection molded product for multiple combinations of process variables was developed. And the accuracy of each ANN model was compared for 8 process variables and 4 important process inputs that were determined by the feature selection. Based on the comparison, it was verified that the performance of the ANN model increased when only 4 important variables were applied.

개인사업자 부도율 예측 모델에서 신용정보 특성 선택 방법 (The Credit Information Feature Selection Method in Default Rate Prediction Model for Individual Businesses)

  • 홍동숙;백한종;신현준
    • 한국시뮬레이션학회논문지
    • /
    • 제30권1호
    • /
    • pp.75-85
    • /
    • 2021
  • 본 논문에서는 개인사업자 부도율을 보다 정확하게 예측하기 위한 새로운 방법으로 개인사업자의 기업 신용 및 개인 신용정보를 가공, 분석하여 입력 특성으로 활용하는 심층 신경망기반 예측 모델을 제시한다. 다양한 분야의 모델링 연구에서 특성 선택 기법은 특히 많은 특성을 포함하는 예측 모델에서 성능 개선을 위한 방법으로 활발히 연구되어 왔다. 본 논문에서는 부도율 예측 모델에 이용된 입력 변수인 거시경제지표(거시변수)와 신용정보(미시변수)에 대한 통계적 검증 이후 추가적으로 신용정보 특성 선택 방법을 통해 예측 성능을 개선하는 특성 집합을 확인할 수 있다. 제안하는 신용정보 특성 선택 방법은 통계적 검증을 수행하는 필터방법과 다수 래퍼를 결합 사용하는 반복적·하이브리드 방법으로, 서브 모델들을 구축하고 최대 성능 모델의 중요 변수를 추출하여 부분집합을 구성 한 후 부분집합과 그 결합셋에 대한 예측 성능 분석을 통해 최종 특성 집합을 결정한다.

목소리 특성의 주관적 평가와 음성 특징과의 상관관계 기초연구 (A Preliminary Study on Correlation between Voice Characteristics and Speech Features)

  • 한성만;김상범;김종열;권철홍
    • 말소리와 음성과학
    • /
    • 제3권4호
    • /
    • pp.85-91
    • /
    • 2011
  • Sasang constitution medicine utilizes voice characteristics to diagnose a person's constitution. To classify Sasang constitutional groups using speech information technology, this study aims at establishing the relationship between Sasang constitutional groups and their corresponding voice characteristics by investigating various speech feature variables. The speech variables include features related to speech source and vocal tract filter. Experimental results show that statistically significant correlation between voice characteristics and some speech feature variables is observed.

  • PDF

Analyzing empirical performance of correlation based feature selection with company credit rank score dataset - Emphasis on KOSPI manufacturing companies -

  • Nam, Youn Chang;Lee, Kun Chang
    • 한국컴퓨터정보학회논문지
    • /
    • 제21권4호
    • /
    • pp.63-71
    • /
    • 2016
  • This paper is about applying efficient data mining method which improves the score calculation and proper building performance of credit ranking score system. The main idea of this data mining technique is accomplishing such objectives by applying Correlation based Feature Selection which could also be used to verify the properness of existing rank scores quickly. This study selected 2047 manufacturing companies on KOSPI market during the period of 2009 to 2013, which have their own credit rank scores given by NICE information service agency. Regarding the relevant financial variables, total 80 variables were collected from KIS-Value and DART (Data Analysis, Retrieval and Transfer System). If correlation based feature selection could select more important variables, then required information and cost would be reduced significantly. Through analysis, this study show that the proposed correlation based feature selection method improves selection and classification process of credit rank system so that the accuracy and credibility would be increased while the cost for building system would be decreased.

PCA 및 변수 중요도를 활용한 냉동컨테이너 고장 탐지 방법론 비교 연구 (A Comparative Study on the Methodology of Failure Detection of Reefer Containers Using PCA and Feature Importance)

  • 이승현;박성호;이승재;이희원;유성열;이강배
    • 한국융합학회논문지
    • /
    • 제13권3호
    • /
    • pp.23-31
    • /
    • 2022
  • 본 연구는 H해운사에서 제공받은 Starcool사의 실제 냉동 컨테이너 운영데이터를 분석하였다. H사의 현장 전문가와 인터뷰를 통해 4가지 고장 알람 중 Critical 및 Fatal Alarm만 고장으로 정의하였고, 냉동 컨테이너 특성상 모든 변수를 사용하는 것은 비용측면에서 비효율을 초래하는 것을 확인하였다. 이에 본 연구는 특성 중요도 및 PCA 기법을 통한 냉동 컨테이너 고장 탐지 방법을 제시한다. 모델의 성능 향상을 위해 XGBoost, LGBoost 등과 같은 트리계열 모델을 통해 변수 중요도(Feature Importance)를 기반으로 변수 선택(Feature selcetion)을 하고 선택되지 않은 변수는 PCA를 사용하여 전체 변수의 차원을 축소시켜 각 모델별로 지도학습을 수행한다. 부스팅 기반의 XGBoost, LGBoost 기법은 본 연구에서 제안하는 모델의 결과가 62개의 모든 변수를 사용한 지도 학습의 결과보다 재현율(Recall)이 각각 0.36, 0.39씩 향상되는 되는 결과를 보였다.

용접결함의 형상인식을 위한 특징변수 추출에 관한 연구 (A Study on the Extraction of Feature Variables for the Pattern Recognition of Welding Flaws)

  • 김재열;노병옥;유신;김창현;고명수
    • 한국정밀공학회지
    • /
    • 제19권11호
    • /
    • pp.103-111
    • /
    • 2002
  • In this study, the natural flaws in welding parts are classified using the signal pattern classification method. The storage digital oscilloscope including FFT function and enveloped waveform generator is used and the signal pattern recognition procedure is made up the digital signal processing, feature extraction, feature selection and classifier design. It is composed with and discussed using the distance classifier that is based on euclidean distance the empirical Bayesian classifier. feature extraction is performed using the class-mean scatter criteria. The signal pattern classification method is applied to the signal pattern recognition of natural flaws.