• Title/Summary/Keyword: Statistics Classification

Search Result 867, Processing Time 0.022 seconds

A Note on Linear SVM in Gaussian Classes

  • Jeon, Yongho
    • Communications for Statistical Applications and Methods
    • /
    • v.20 no.3
    • /
    • pp.225-233
    • /
    • 2013
  • The linear support vector machine(SVM) is motivated by the maximal margin separating hyperplane and is a popular tool for binary classification tasks. Many studies exist on the consistency properties of SVM; however, it is unknown whether the linear SVM is consistent for estimating the optimal classification boundary even in the simple case of two Gaussian classes with a common covariance, where the optimal classification boundary is linear. In this paper we show that the linear SVM can be inconsistent in the univariate Gaussian classification problem with a common variance, even when the best tuning parameter is used.

Classification of ratings in online reviews (온라인 리뷰에서 평점의 분류)

  • Choi, Dongjun;Choi, Hosik;Park, Changyi
    • Journal of the Korean Data and Information Science Society
    • /
    • v.27 no.4
    • /
    • pp.845-854
    • /
    • 2016
  • Sentiment analysis or opinion mining is a technique of text mining employed to identify subjective information or opinions of an individual from documents in blogs, reviews, articles, or social networks. In the literature, only a problem of binary classification of ratings based on review texts in an online review. However, because there can be positive or negative reviews as well as neutral reviews, a multi-class classification will be more appropriate than the binary classification. To this end, we consider the multi-class classification of ratings based on review texts. In the preprocessing stage, we extract words related with ratings using chi-square statistic. Then the extracted words are used as input variables to multi-class classifiers such as support vector machines and proportional odds model to compare their predictive performances.

Classification accuracy measures with minimum error rate for normal mixture (정규혼합분포에서 최소오류의 분류정확도 측도)

  • Hong, C.S.;Lin, Meihua;Hong, S.W.;Kim, G.C.
    • Journal of the Korean Data and Information Science Society
    • /
    • v.22 no.4
    • /
    • pp.619-630
    • /
    • 2011
  • In order to estimate an appropriate threshold and evaluate its performance for the data mixed with two different distributions, nine kinds of well-known classification accuracy measures such as MVD, Youden's index, the closest-to- (0,1) criterion, the amended closest-to- (0,1) criterion, SSS, symmetry point, accuracy area, TA, TR are clustered into five categories on the basis of their characters. In credit evaluation study, it is assumed that the score random variable follows normal mixture distributions of the default and non-default states. For various normal mixtures, optimal cut-off points for classification measures belong to each category are obtained and type I and II error rates corresponding to these cut-off points are calculated. Then we explore the cases when these error rates are minimized. If normal mixtures might be estimated for these kinds of real data, we could make use of results of this study to select the best classification accuracy measure which has the minimum error rate.

Optimal Criterion of Classification Accuracy Measures for Normal Mixture (정규혼합에서 분류정확도 측도들의 최적기준)

  • Yoo, Hyun-Sang;Hong, Chong-Sun
    • Communications for Statistical Applications and Methods
    • /
    • v.18 no.3
    • /
    • pp.343-355
    • /
    • 2011
  • For a data with the assumption of the mixture distribution, it is important to find an appropriate threshold and evaluate its performance. The relationship is found of well-known nine classification accuracy measures such as MVD, Youden's index, the closest-to-(0, 1) criterion, the amended closest-to-(0, 1) criterion, SSS, symmetry point, accuracy area, TA, TR. Then some conditions of these measures are categorized into seven groups. Under the normal mixture assumption, we calculate thresholds based on these measures and obtain the corresponding type I and II errors. We could explore that which classification measure has minimum type I and II errors for estimated mixture distribution to understand the strength and weakness of these classification measures.

A Classification Analysis using Bayesian Neural Network (베이지안 신경망을 이용한 분류분석)

  • Hwang, Jin-Soo;Choi, Seong-Yong;Jun, Hong-Suk
    • Journal of the Korean Data and Information Science Society
    • /
    • v.12 no.2
    • /
    • pp.11-25
    • /
    • 2001
  • There are several algorithms for classification in modeling relations, patterns, and rules which exist in data. We learn to classify objects on the basis of instances presented to us, not by being given a set of classification rules. The Bayesian learning uses the probability distribution to express our knowledge about unknown parameters and update our knowledge by the law of probability as the evidence gathered from data. Also, the neural network models are designed for predicting an unknown category or quantity on the basis of known attributes by training. In this paper, we compare the misclassification error rates of Bayesian Neural Network method with those of other classification algorithms, CHAID, CART, and QUBST using several data sets.

  • PDF

Crop Classification for Inaccessible Areas using Semi-Supervised Learning and Spatial Similarity - A Case Study in the Daehongdan Region, North Korea - (준감독 학습과 공간 유사성을 이용한 비접근 지역의 작물 분류 - 북한 대홍단 지역 사례 연구 -)

  • Kwak, Geun-Ho;Park, No-Wook;Lee, Kyung-Do;Choi, Ki-Young
    • Korean Journal of Remote Sensing
    • /
    • v.33 no.5_2
    • /
    • pp.689-698
    • /
    • 2017
  • In this paper, a new classification method based on the combination of semi-supervised learning with spatial similarity of adjacent pixels is presented for crop classification in inaccessible areas. Iterative classification based on semi-supervised learning is applied to extract reliable training data from both the initial classification result with a small number of training data, and classification results of adjacent pixels are also considered to extract new training pixels with less uncertainty. To evaluate the applicability of the proposed method, a case study of the classification of field crops was carried out using multi-temporal Landsat-8 OLI acquired in the Daehongdan region, North Korea. From a case study, the misclassification of crops and forests, and isolated pixels in the initial classification result were greatly reduced by applying the proposed semi-supervised learning method. In addition, the combination of classification results of adjacent pixels for the extraction of new training data led to the great reduction of both misclassification results and isolated pixels, compared to the initial classification and traditional semi-supervised learning results. Therefore, it is expected that the proposed method would be effectively applied to classify areas in which it is difficult to collect sufficient training data.

New Splitting Criteria for Classification Trees

  • Lee, Yung-Seop
    • Communications for Statistical Applications and Methods
    • /
    • v.8 no.3
    • /
    • pp.885-894
    • /
    • 2001
  • Decision tree methods is the one of data mining techniques. Classification trees are used to predict a class label. When a tree grows, the conventional splitting criteria use the weighted average of the left and the right child nodes for measuring the node impurity. In this paper, new splitting criteria for classification trees are proposed which improve the interpretablity of trees comparing to the conventional methods. The criteria search only for interesting subsets of the data, as opposed to modeling all of the data equally well. As a result, the tree is very unbalanced but extremely interpretable.

  • PDF

Evaluating Predictive Ability of Classification Models with Ordered Multiple Categories

  • Oong-Hyun Sung
    • Communications for Statistical Applications and Methods
    • /
    • v.6 no.2
    • /
    • pp.383-395
    • /
    • 1999
  • This study is concerned with the evaluation of predictive ability of classification models with ordered multiple categories. If categories can be ordered or ranked the spread of misclassification should be considered to evaluate the performance of the classification models using loss rate since the apparent error rate can not measure the spread of misclassification. Since loss rate is known to underestimate the true loss rate the bootstrap method were used to estimate the true loss rate. thus this study suggests the method to evaluate the predictive power of the classification models using loss rate and the bootstrap estimate of the true loss rate.

  • PDF

Classification Using Sliced Inverse Regression and Sliced Average Variance Estimation

  • Lee, Hakbae
    • Communications for Statistical Applications and Methods
    • /
    • v.11 no.2
    • /
    • pp.275-285
    • /
    • 2004
  • We explore classification analysis using graphical methods such as sliced inverse regression and sliced average variance estimation based on dimension reduction. Some useful information about classification analysis are obtained by sliced inverse regression and sliced average variance estimation through dimension reduction. Two examples are illustrated, and classification rates by sliced inverse regression and sliced average variance estimation are compared with those by discriminant analysis and logistic regression.

Network Classification of P2P Traffic with Various Classification Methods (다양한 분류기법을 이용한 네트워크상의 P2P 데이터 분류실험)

  • Han, Seokwan;Hwang, Jinsoo
    • The Korean Journal of Applied Statistics
    • /
    • v.28 no.1
    • /
    • pp.1-8
    • /
    • 2015
  • Security has become an issue due to the rapid increases in internet traffic data network. Especially P2P traffic data poses a great challenge to network systems administrators. Preemptive measures are necessary for network quality of service(QoS) and efficient resource management like blocking suspicious traffic data. Deep packet inspection(DPI) is the most exact way to detect an intrusion but it may pose a private security problem that requires time. We used several machine learning methods to compare the performance in classifying network traffic data accurately over time. The Random Forest method shows an excellent performance in both accuracy and time.