• Title/Summary/Keyword: Statistics Classification

Search Result 867, Processing Time 0.026 seconds

Recent deep learning methods for tabular data

  • Yejin Hwang;Jongwoo Song
    • Communications for Statistical Applications and Methods
    • /
    • v.30 no.2
    • /
    • pp.215-226
    • /
    • 2023
  • Deep learning has made great strides in the field of unstructured data such as text, images, and audio. However, in the case of tabular data analysis, machine learning algorithms such as ensemble methods are still better than deep learning. To keep up with the performance of machine learning algorithms with good predictive power, several deep learning methods for tabular data have been proposed recently. In this paper, we review the latest deep learning models for tabular data and compare the performances of these models using several datasets. In addition, we also compare the latest boosting methods to these deep learning methods and suggest the guidelines to the users, who analyze tabular datasets. In regression, machine learning methods are better than deep learning methods. But for the classification problems, deep learning methods perform better than the machine learning methods in some cases.

On Practical Choice of Smoothing Parameter in Nonparametric Classification (베이즈 리스크를 이용한 커널형 분류에서 평활모수의 선택)

  • Kim, Rae-Sang;Kang, Kee-Hoon
    • Communications for Statistical Applications and Methods
    • /
    • v.15 no.2
    • /
    • pp.283-292
    • /
    • 2008
  • Smoothing parameter or bandwidth plays a key role in nonparametric classification based on kernel density estimation. We consider choosing smoothing parameter in nonparametric classification, which optimize the Bayes risk. Hall and Kang (2005) clarified the theoretical properties of smoothing parameter in terms of minimizing Bayes risk and derived the optimal order of it. Bootstrap method was used in their exploring numerical properties. We compare cross-validation and bootstrap method numerically in terms of optimal order of bandwidth. Effects on misclassification rate are also examined. We confirm that bootstrap method is superior to cross-validation in both cases.

On the Use of Adaptive Weights for the F-Norm Support Vector Machine

  • Bang, Sung-Wan;Jhun, Myoung-Shic
    • The Korean Journal of Applied Statistics
    • /
    • v.25 no.5
    • /
    • pp.829-835
    • /
    • 2012
  • When the input features are generated by factors in a classification problem, it is more meaningful to identify important factors, rather than individual features. The $F_{\infty}$-norm support vector machine(SVM) has been developed to perform automatic factor selection in classification. However, the $F_{\infty}$-norm SVM may suffer from estimation inefficiency and model selection inconsistency because it applies the same amount of shrinkage to each factor without assessing its relative importance. To overcome such a limitation, we propose the adaptive $F_{\infty}$-norm ($AF_{\infty}$-norm) SVM, which penalizes the empirical hinge loss by the sum of the adaptively weighted factor-wise $L_{\infty}$-norm penalty. The $AF_{\infty}$-norm SVM computes the weights by the 2-norm SVM estimator and can be formulated as a linear programming(LP) problem which is similar to the one of the $F_{\infty}$-norm SVM. The simulation studies show that the proposed $AF_{\infty}$-norm SVM improves upon the $F_{\infty}$-norm SVM in terms of classification accuracy and factor selection performance.

Selecting Ordering Policy and Items Classification Based on Canonical Correlation and Cluster Analysis

  • Nagasawa, Keisuke;Irohara, Takashi;Matoba, Yosuke;Liu, Shuling
    • Industrial Engineering and Management Systems
    • /
    • v.11 no.2
    • /
    • pp.134-141
    • /
    • 2012
  • It is difficult to find an appropriate ordering policy for a many types of items. One of the reasons for this difficulty is that each item has a different demand trend. We will classify items by shipment trend and then decide the ordering policy for each item category. In this study, we indicate that categorizing items from their statistical characteristics leads to an ordering policy suitable for that category. We analyze the ordering policy and shipment trend and propose a new method for selecting the ordering policy which is based on finding the strongest relation between the classification of the items and the ordering policy. In our numerical experiment, from actual shipment data of about 5,000 items over the past year, we calculated many statistics that represent the trend of each item. Next, we applied the canonical correlation analysis between the evaluations of ordering policies and the various statistics. Furthermore, we applied the cluster analysis on the statistics concerning the performance of ordering policies. Finally, we separate items into several categories and show that the appropriate ordering policies are different for each category.

A Study on Classification and Localization of Structural Damage through Wavelet Analysis

  • Koh, Bong-Hwan;Jung, Uk
    • Proceedings of the Korean Society for Noise and Vibration Engineering Conference
    • /
    • 2007.11a
    • /
    • pp.754-759
    • /
    • 2007
  • This study exploits the data discriminating capability of silhouette statistics, which combines wavelet-based vertical energy threshold technique for the purpose of extracting damage-sensitive features and clustering signals of the same class. This threshold technique allows to first obtain a suitable subset of the extracted or modified features of our data, i.e., good predictor sets should contain features that are strongly correlated to the characteristics of the data without considering the classification method used, although each of these features should be as uncorrelated with each other as possible. The silhouette statistics have been used to assess the quality of clustering by measuring how well an object is assigned to its corresponding cluster. We use this concept for the discriminant power function used in this paper. The simulation results of damage detection in a truss structure show that the approach proposed in this study can be successfully applied for locating both open- and breathing-type damage even in the presence of a considerable amount of process and measurement noise.

  • PDF

Steal Success Model for 2007 Korean Professional Baseball Games (2007년 한국프로야구에서 도루성공모형)

  • Hong, Chong-Sun;Choi, Jeong-Min
    • The Korean Journal of Applied Statistics
    • /
    • v.21 no.3
    • /
    • pp.455-468
    • /
    • 2008
  • Based on the huge baseball game records, the steal plays an important role to affect the result of games. For the research about success or failure of the steal in baseball games, logistic regression models are developed based on 2007 Korean professional baseball games. The analyses of logistic regression models are compared of those of the discriminant models. It is found that the performance of the logistic regression analysis is more efficient than that of the discriminant analysis. Also, we consider an alternative logistic regression model based on categorical data which are transformed from uneasy obtainable continuous data.

Algorithm for the Robust Estimation in Logistic Regression (로지스틱회귀모형의 로버스트 추정을 위한 알고리즘)

  • Kim, Bu-Yong;Kahng, Myung-Wook;Choi, Mi-Ae
    • The Korean Journal of Applied Statistics
    • /
    • v.20 no.3
    • /
    • pp.551-559
    • /
    • 2007
  • The maximum likelihood estimation is not robust against outliers in the logistic regression. Thus we propose an algorithm for the robust estimation, which identifies the bad leverage points and vertical outliers by the V-mask type criterion, and then strives to dampen the effect of outliers. Our main finding is that, by an appropriate selection of weights and factors, we could obtain the logistic estimates with high breakdown point. The proposed algorithm is evaluated by means of the correct classification rate on the basis of real-life and artificial data sets. The results indicate that the proposed algorithm is superior to the maximum likelihood estimation in terms of the classification.

Time series representation for clustering using unbalanced Haar wavelet transformation (불균형 Haar 웨이블릿 변환을 이용한 군집화를 위한 시계열 표현)

  • Lee, Sehun;Baek, Changryong
    • The Korean Journal of Applied Statistics
    • /
    • v.31 no.6
    • /
    • pp.707-719
    • /
    • 2018
  • Various time series representation methods have been proposed for efficient time series clustering and classification. Lin et al. (DMKD, 15, 107-144, 2007) proposed a symbolic aggregate approximation (SAX) method based on symbolic representations after approximating the original time series using piecewise local mean. The performance of SAX therefore depends heavily on how well the piecewise local averages approximate original time series features. SAX equally divides the entire series into an arbitrary number of segments; however, it is not sufficient to capture key features from complex, large-scale time series data. Therefore, this paper considers data-adaptive local constant approximation of the time series using the unbalanced Haar wavelet transformation. The proposed method is shown to outperforms SAX in many real-world data applications.

A Study on the Classification of Geospatial Industry based on the Korea Standard Industry Classification (한국표준산업분류에 기초한 공간정보산업의 분류에 관한 연구)

  • Ahn, Jae-Seong;Kim, Hyung-Tae;Heo, Min;Lee, Byoung-Kil
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.29 no.4
    • /
    • pp.421-428
    • /
    • 2011
  • It is challenging to survey the size and economical value of geospatial industry, because of the vagueness of the industry range. This study suggested a proposed method for the classification of geospatial industry based on Korea Standard Industry Classification, Th proposed method for the classification considered the value added chain of geospatial industry and Korean Standard Industry Classification, Theses considerations reflected characteristics of geospatial industry, Industrial statistics of geospatial industry are expected to be surveyed based on the classification proposed by this study,

A comparison study of classification method based of SVM and data depth in microarray data (마이크로어레이 자료에서 서포트벡터머신과 데이터 뎁스를 이용한 분류방법의 비교연구)

  • Hwang, Jin-Soo;Kim, Jee-Yun
    • Journal of the Korean Data and Information Science Society
    • /
    • v.20 no.2
    • /
    • pp.311-319
    • /
    • 2009
  • A robust L1 data depth was used in clustering and classification, so called DDclus and DDclass by Jornsten (2004). SVM-based classification works well in most of the situation but show some weakness in the presence of outliers. Proper gene selection is important in classification since there are so many redundant genes. Either by selecting appropriate genes or by gene clustering combined with classification method enhance the overall performance of classification. The performance of depth based method are evaluated among several SVM-based classification methods.

  • PDF