• 제목/요약/키워드: Statistics Classification

검색결과 876건 처리시간 0.023초

Estimating Prediction Errors in Binary Classification Problem: Cross-Validation versus Bootstrap

  • Kim Ji-Hyun;Cha Eun-Song
    • Communications for Statistical Applications and Methods
    • /
    • 제13권1호
    • /
    • pp.151-165
    • /
    • 2006
  • It is important to estimate the true misclassification rate of a given classifier when an independent set of test data is not available. Cross-validation and bootstrap are two possible approaches in this case. In related literature bootstrap estimators of the true misclassification rate were asserted to have better performance for small samples than cross-validation estimators. We compare the two estimators empirically when the classification rule is so adaptive to training data that its apparent misclassification rate is close to zero. We confirm that bootstrap estimators have better performance for small samples because of small variance, and we have found a new fact that their bias tends to be significant even for moderate to large samples, in which case cross-validation estimators have better performance with less computation.

A Resetting Scheme for Process Parameters using the Mahalanobis-Taguchi System

  • Park, Chang-Soon
    • 응용통계연구
    • /
    • 제25권4호
    • /
    • pp.589-603
    • /
    • 2012
  • Mahalanobis-Taguchi system(MTS) is a statistical tool for classifying the normal group and abnormal group in multivariate data structures. In addition to the classification itself, the MTS uses a method for selecting variables useful for the classification. This method can be used efficiently especially when the abnormal group data are scattered without a specific directionality. When the feedback adjustment procedure through the measurements of the process output for controlling process input variables is not practically possible, the reset procedure can be an alternative one. This article proposes a reset procedure using the MTS. Moreover, a method for identifying input variables to reset is also proposed by the use of the contribution. The identification of the root-cause parameters using the existing dimension-reduced contribution tends to be difficult due to the variety of correlation relationships of multivariate data structures. However, it became possible to provide an improved decision when used together with the location-centered contribution and the individual-parameter contribution.

옹벽 구조물의 표준 DB화 방안 및 유지관리 특성 연구 (A Study on Characteristics of Maintenance and Standarization Plan Concerned with DB of Retainging Wall)

  • 이송;심민보
    • 한국구조물진단유지관리공학회 논문집
    • /
    • 제4권4호
    • /
    • pp.129-140
    • /
    • 2000
  • Retaining wall is a constructed structure in order to construct road, rail, building for effective use and obtainments of the limited ground. Recently, many kinds of research have been actively developed for a standardization and information to the maintenance and management of bridge, tunnel, road. With the works of database construction of that, many kinds of data with respect to statistics is cumulated. Database work of retaining wall is wholly lacking and lagged behind in the works of database construction. This paper suggests classification system on inspection data. On the basis of that, code work with classification system was practised and DB program of inspection data of retaining wall was developed. And input work for a data of maintenance and management was practised. The purpose of this paper is to suggest a kind of statistics data and investigate a characteristics of inspection using statistic data on retaining wall.

  • PDF

컨볼루션 뉴럴 네트워크를 이용한 한글 서체 특징 연구 (A study in Hangul font characteristics using convolutional neural networks)

  • 황인경;원중호
    • 응용통계연구
    • /
    • 제32권4호
    • /
    • pp.573-591
    • /
    • 2019
  • 로마자 서체에 대한 수치적 분류체계는 잘 발달되어 있지만, 한글 서체 분류를 위한 기준은 수치적으로 잘 정의되어 있지 않다. 본 연구의 목표는 한글 서체 분류를 위한 수치적 기준을 세우기 위해, 서체 스타일을 구분하는 중요한 특징들을 찾는 것이다. 컨볼루션 뉴럴 네트워크(convolutional neural network)를 사용하여 명조와 고딕 스타일을 구분하는 모형을 세우고, 학습된 필터를 분석해 두 스타일의 특징을 결정하는 피처(feature)를 찾고자 한다.

Double-Bagging Ensemble Using WAVE

  • Kim, Ahhyoun;Kim, Minji;Kim, Hyunjoong
    • Communications for Statistical Applications and Methods
    • /
    • 제21권5호
    • /
    • pp.411-422
    • /
    • 2014
  • A classification ensemble method aggregates different classifiers obtained from training data to classify new data points. Voting algorithms are typical tools to summarize the outputs of each classifier in an ensemble. WAVE, proposed by Kim et al. (2011), is a new weight-adjusted voting algorithm for ensembles of classifiers with an optimal weight vector. In this study, when constructing an ensemble, we applied the WAVE algorithm on the double-bagging method (Hothorn and Lausen, 2003) to observe if any significant improvement can be achieved on performance. The results showed that double-bagging using WAVE algorithm performs better than other ensemble methods that employ plurality voting. In addition, double-bagging with WAVE algorithm is comparable with the random forest ensemble method when the ensemble size is large.

Ensemble approach for improving prediction in kernel regression and classification

  • Han, Sunwoo;Hwang, Seongyun;Lee, Seokho
    • Communications for Statistical Applications and Methods
    • /
    • 제23권4호
    • /
    • pp.355-362
    • /
    • 2016
  • Ensemble methods often help increase prediction ability in various predictive models by combining multiple weak learners and reducing the variability of the final predictive model. In this work, we demonstrate that ensemble methods also enhance the accuracy of prediction under kernel ridge regression and kernel logistic regression classification. Here we apply bagging and random forests to two kernel-based predictive models; and present the procedure of how bagging and random forests can be embedded in kernel-based predictive models. Our proposals are tested under numerous synthetic and real datasets; subsequently, they are compared with plain kernel-based predictive models and their subsampling approach. Numerical studies demonstrate that ensemble approach outperforms plain kernel-based predictive models.

Polychotomous Machines;

  • Koo, Ja-Yong;Park, Heon Jin;Choi, Daewoo
    • Communications for Statistical Applications and Methods
    • /
    • 제10권1호
    • /
    • pp.225-232
    • /
    • 2003
  • The support vector machine (SVM) is becoming increasingly popular in classification. The import vector machine (IVM) has been introduced for its advantages over SMV. This paper tries to improve the IVM. The proposed method, which is referred to as the polychotomous machine (PM), uses the Newton-Raphson method to find estimates of coefficients, and the Rao and Wald tests, respectively, for addition and deletion of import points. Because the PM basically follows the same addition step and adopts the deletion step, it uses, typically, less import vectors than the IVM without loosing accuracy. Simulated and real data sets are used to illustrate the performance of the proposed method.

Robust inference with order constraint in microarray study

  • Kang, Joonsung
    • Communications for Statistical Applications and Methods
    • /
    • 제25권5호
    • /
    • pp.559-568
    • /
    • 2018
  • Gene classification can involve complex order-restricted inference. Examining gene expression pattern across groups with order-restriction makes standard statistical inference ineffective and thus, requires different methods. For this problem, Roy's union-intersection principle has some merit. The M-estimator adjusting for outlier arrays in a microarray study produces a robust test statistic with distribution-insensitive clustering of genes. The M-estimator in conjunction with a union-intersection principle provides a nonstandard robust procedure. By exact permutation distribution theory, a conditionally distribution-free test based on the proposed test statistic generates corresponding p-values in a small sample size setup. We apply a false discovery rate (FDR) as a multiple testing procedure to p-values in simulated data and real microarray data. FDR procedure for proposed test statistics controls the FDR at all levels of ${\alpha}$ and ${\pi}_0$ (the proportion of true null); however, the FDR procedure for test statistics based upon normal theory (ANOVA) fails to control FDR.

편파화 정도와 동일 편파 위상 차를 이용한 SAR 영상 분류 (Polarimetric SAR Image Classification Based on the Degree of Polarization and Co-Polarized Phase-Difference Statistics)

  • 장지성;오이석
    • 한국전자파학회논문지
    • /
    • 제18권12호
    • /
    • pp.1345-1351
    • /
    • 2007
  • 본 논문에서는 편파화 정도(Degree of Polarization: DoP)와 동일 편파 위상차(Co-polarized Phase-Difference: CPD)를 이용한 SAR 영상 분류법을 제안한다. 우선, 측정된 stokes 산란 operator로부터 DoP와 CPD를 얻는 계산식을 유도하고, SAR 영상 분류 과정을 설명한다. 다음에는 측정에서 얻은 완전 편파 L밴드 SAR 영상 데이터에 분류법을 적용하여 그 정확성을 검증하고, 예외 경우를 검토한다. 마지막으로 제안된 분류법으로 SAR 영상을 크게 4가지 그룹인 맨땅, 낮은 식물, 높은 식물, 주거 지역(마을)으로 분류한 결과를 보인다.

대용량 자료에서 핵심적인 소수의 변수들의 선별과 로지스틱 회귀 모형의 전개 (Screening Vital Few Variables and Development of Logistic Regression Model on a Large Data Set)

  • 임용빈;조재연;엄경아;이선아
    • 품질경영학회지
    • /
    • 제34권2호
    • /
    • pp.129-135
    • /
    • 2006
  • In the advance of computer technology, it is possible to keep all the related informations for monitoring equipments in control and huge amount of real time manufacturing data in a data base. Thus, the statistical analysis of large data sets with hundreds of thousands observations and hundred of independent variables whose some of values are missing at many observations is needed even though it is a formidable computational task. A tree structured approach to classification is capable of screening important independent variables and their interactions. In a Six Sigma project handling large amount of manufacturing data, one of the goals is to screen vital few variables among trivial many variables. In this paper we have reviewed and summarized CART, C4.5 and CHAID algorithms and proposed a simple method of screening vital few variables by selecting common variables screened by all the three algorithms. Also how to develop a logistics regression model on a large data set is discussed and illustrated through a large finance data set collected by a credit bureau for th purpose of predicting the bankruptcy of the company.