• Title/Summary/Keyword: Feature statistics

Search Result 256, Processing Time 0.026 seconds

Tree-structured Classification based on Variable Splitting

  • Ahn, Sung-Jin
    • Communications for Statistical Applications and Methods
    • /
    • v.2 no.1
    • /
    • pp.74-88
    • /
    • 1995
  • This article introduces a unified method of choosing the most explanatory and significant multiway partitions for classification tree design and analysis. The method is derived on the impurity reduction (IR) measure of divergence, which is proposed to extend the proportional-reduction-in-error (PRE) measure in the decision-theory context. For the method derivation, the IR measure is analyzed to characterize its statistical properties which are used to consistently handle the subjects of feature formation, feature selection, and feature deletion required in the associated classification tree construction. A numerical example is considered to illustrate the proposed approach.

  • PDF

Truck Weight Estimation using Operational Statistics at 3rd Party Logistics Environment (운영 데이터를 활용한 제3자 물류 환경에서의 배송 트럭 무게 예측)

  • Yu-jin Lee;Kyung Min Choi;Song-eun Kim;Kyungsu Park;Seung Hwan Jung
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.45 no.4
    • /
    • pp.127-133
    • /
    • 2022
  • Many manufacturers applying third party logistics (3PLs) have some challenges to increase their logistics efficiency. This study introduces an effort to estimate the weight of the delivery trucks provided by 3PL providers, which allows the manufacturer to package and load products in trailers in advance to reduce delivery time. The accuracy of the weigh estimation is more important due to the total weight regulation. This study uses not only the data from the company but also many general prediction variables such as weather, oil prices and population of destinations. In addition, operational statistics variables are developed to indicate the availabilities of the trucks in a specific weight category for each 3PL provider. The prediction model using XGBoost regressor and permutation feature importance method provides highly acceptable performance with MAPE of 2.785% and shows the effectiveness of the developed operational statistics variables.

Feature selection and prediction modeling of drug responsiveness in Pharmacogenomics (약물유전체학에서 약물반응 예측모형과 변수선택 방법)

  • Kim, Kyuhwan;Kim, Wonkuk
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.2
    • /
    • pp.153-166
    • /
    • 2021
  • A main goal of pharmacogenomics studies is to predict individual's drug responsiveness based on high dimensional genetic variables. Due to a large number of variables, feature selection is required in order to reduce the number of variables. The selected features are used to construct a predictive model using machine learning algorithms. In the present study, we applied several hybrid feature selection methods such as combinations of logistic regression, ReliefF, TurF, random forest, and LASSO to a next generation sequencing data set of 400 epilepsy patients. We then applied the selected features to machine learning methods including random forest, gradient boosting, and support vector machine as well as a stacking ensemble method. Our results showed that the stacking model with a hybrid feature selection of random forest and ReliefF performs better than with other combinations of approaches. Based on a 5-fold cross validation partition, the mean test accuracy value of the best model was 0.727 and the mean test AUC value of the best model was 0.761. It also appeared that the stacking models outperform than single machine learning predictive models when using the same selected features.

Properties of chi-square statistic and information gain for feature selection of imbalanced text data (불균형 텍스트 데이터의 변수 선택에 있어서의 카이제곱통계량과 정보이득의 특징)

  • Mun, Hye In;Son, Won
    • The Korean Journal of Applied Statistics
    • /
    • v.35 no.4
    • /
    • pp.469-484
    • /
    • 2022
  • Since a large text corpus contains hundred-thousand unique words, text data is one of the typical large-dimensional data. Therefore, various feature selection methods have been proposed for dimension reduction. Feature selection methods can improve the prediction accuracy. In addition, with reduced data size, computational efficiency also can be achieved. The chi-square statistic and the information gain are two of the most popular measures for identifying interesting terms from text data. In this paper, we investigate the theoretical properties of the chi-square statistic and the information gain. We show that the two filtering metrics share theoretical properties such as non-negativity and convexity. However, they are different from each other in the sense that the information gain is prone to select more negative features than the chi-square statistic in imbalanced text data.

Structural Quality Defect Discrimination Enhancement using Vertical Energy-based Wavelet Feature Generation (구조물의 품질 결함 변별력 증대를 위한 수직 에너지 기반의 웨이블릿 Feature 생성)

  • Kim, Joon-Seok;Jung, Uk
    • Journal of Korean Society for Quality Management
    • /
    • v.36 no.2
    • /
    • pp.36-44
    • /
    • 2008
  • In this paper a novel feature extraction and selection is carried out in order to improve the discriminating capability between healthy and damaged structure using vibration signals. Although many feature extraction and selection algorithms have been proposed for vibration signals, most proposed approaches don't consider the discriminating ability of features since they are usually in unsupervised manner. We proposed a novel feature extraction and selection algorithm selecting few wavelet coefficients with higher class discriminating capability for damage detection and class visualization. We applied three class separability measures to evaluate the features, i.e. T test statistics, divergence, and Bhattacharyya distance. Experiments with vibration signals from truss structure demonstrate that class separabilities are significantly enhanced using our proposed algorithm compared to other two algorithms with original time-based features and Fourier-based ones.

Sign Language Shape Recognition Using SOFM Neural Network (SOFM신경망을 이용한 수화 형상 인식)

  • Kim, Kyoung-Ho;Kim, Jong-Min;Jeong, Jea-Young;Lee, Woong-Ki
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2009.11a
    • /
    • pp.283-284
    • /
    • 2009
  • 본 논문은 단일 카메라 환경에서 손 형상을 입력정보로 사용하여 손 영역만을 분할한 후 자기 조직화 특징 지도(SOFM: Self Organized Feature Map) 신경망 알고리즘을 이용하여 손 형상을 인식함으로서 수화인식을 위한 보다 안정적이며 강인한 인식 시스템을 구현하고자 한다.

Speaker Identification Using Higher-Order Statistics In Noisy Environment (고차 통계를 이용한 잡음 환경에서의 화자식별)

  • Shin, Tae-Young;Kim, Gi-Sung;Kwon, Young-Uk;Kim, Hyung-Soon
    • The Journal of the Acoustical Society of Korea
    • /
    • v.16 no.6
    • /
    • pp.25-35
    • /
    • 1997
  • Most of speech analysis methods developed up to date are based on second order statistics, and one of the biggest drawback of these methods is that they show dramatical performance degradation in noisy environments. On the contrary, the methods using higher order statistics(HOS), which has the property of suppressing Gaussian noise, enable robust feature extraction in noisy environments. In this paper we propose a text-independent speaker identification system using higher order statistics and compare its performance with that using the conventional second-order-statistics-based method in both white and colored noise environments. The proposed speaker identification system is based on the vector quantization approach, and employs HOS-based voiced/unvoiced detector in order to extract feature parameters for voiced speech only, which has non-Gaussian distribution and is known to contain most of speaker-specific characteristics. Experimental results using 50 speaker's database show that higher-order-statistics-based method gives a better identificaiton performance than the conventional second-order-statistics-based method in noisy environments.

  • PDF

Feature selection for text data via sparse principal component analysis (희소주성분분석을 이용한 텍스트데이터의 단어선택)

  • Won Son
    • The Korean Journal of Applied Statistics
    • /
    • v.36 no.6
    • /
    • pp.501-514
    • /
    • 2023
  • When analyzing high dimensional data such as text data, if we input all the variables as explanatory variables, statistical learning procedures may suffer from over-fitting problems. Furthermore, computational efficiency can deteriorate with a large number of variables. Dimensionality reduction techniques such as feature selection or feature extraction are useful for dealing with these problems. The sparse principal component analysis (SPCA) is one of the regularized least squares methods which employs an elastic net-type objective function. The SPCA can be used to remove insignificant principal components and identify important variables from noisy observations. In this study, we propose a dimension reduction procedure for text data based on the SPCA. Applying the proposed procedure to real data, we find that the reduced feature set maintains sufficient information in text data while the size of the feature set is reduced by removing redundant variables. As a result, the proposed procedure can improve classification accuracy and computational efficiency, especially for some classifiers such as the k-nearest neighbors algorithm.

Face Recognition Using A New Methodology For Independent Component Analysis (새로운 독립 요소 해석 방법론에 의한 얼굴 인식)

  • 류재흥;고재흥
    • Proceedings of the Korean Institute of Intelligent Systems Conference
    • /
    • 2000.11a
    • /
    • pp.305-309
    • /
    • 2000
  • In this paper, we presents a new methodology for face recognition after analysing conventional ICA(Independent Component Analysis) based approach. In the literature we found that ICA based methods have followed the same procedure without any exception, first PCA(Principal Component Analysis) has been used for feature extraction, next ICA learning method has been applied for feature enhancement in the reduced dimension. However, it is contradiction that features are extracted using higher order moments depend on variance, the second order statistics. It is not considered that a necessary component can be located in the discarded feature space. In the new methodology, features are extracted using the magnitude of kurtosis(4-th order central moment or cumulant). This corresponds to the PCA based feature extraction using eigenvalue(2nd order central moment or variance). The synergy effect of PCA and ICA can be achieved if PCA is used for noise reduction filter. ICA methodology is analysed using SVD(Singular Value Decomposition). PCA does whitening and noise reduction. ICA performs the feature extraction. Simulation results show the effectiveness of the methodology compared to the conventional ICA approach.

  • PDF

A Divisive Clustering for Mixed Feature-Type Symbolic Data (혼합형태 심볼릭 데이터의 군집분석방법)

  • Kim, Jaejik
    • The Korean Journal of Applied Statistics
    • /
    • v.28 no.6
    • /
    • pp.1147-1161
    • /
    • 2015
  • Nowadays we are considering and analyzing not only classical data expressed by points in the p-dimensional Euclidean space but also new types of data such as signals, functions, images, and shapes, etc. Symbolic data also can be considered as one of those new types of data. Symbolic data can have various formats such as intervals, histograms, lists, tables, distributions, models, and the like. Up to date, symbolic data studies have mainly focused on individual formats of symbolic data. In this study, it is extended into datasets with both histogram and multimodal-valued data and a divisive clustering method for the mixed feature-type symbolic data is introduced and it is applied to the analysis of industrial accident data.