• 제목/요약/키워드: Unbalanced Classification

검색결과 43건 처리시간 0.024초

불균형 자료에 대한 분류분석 (Classification Analysis for Unbalanced Data)

  • 김동아;강수연;송종우
    • 응용통계연구
    • /
    • 제28권3호
    • /
    • pp.495-509
    • /
    • 2015
  • 일반적인 2집단 분류(2-class classification)의 경우, 두 집단의 비율이 크게 차이나지 않는 경우가 많다. 본 논문에서는 두 집단의 비율이 크게 차이나는 불균형 데이터(unbalanced data)의 분류 문제에 대해서 다루고자 한다. 불균형 데이터의 분류방법은 균형이 맞는 데이터(balanced data)의 경우보다 분류하기 어려운 경우가 많다. 이런 자료에서 보통의 분류모형을 적용하게 되면 많은 경우에 대부분의 관측치가 큰 집단으로 분류 되는 경우가 많은데 실질적인 어플리케이션에서는 이런 오분류가 손해가 더 큰 경우가 대부분이다. 우리는 sampling 기법을 이용하여 다양한 분류 방법론의 성능을 비교 분석 하였다. 또한 비대칭 손실(asymmetric loss)을 가정한 경우에 어떤 방법론이 가장 작은 loss를 생성하는 지를 비교하였다. 성능 비교를 위해서는 오분류율(misclassification rate), G-mean, ROC, 그리고 AUC(Area under the curve) 등을 이용하였다.

Confidence Intervals on Variance Components in Two-Way Classification with Interaction Model

  • Kim, Jung I.;Park, Sung H.
    • 품질경영학회지
    • /
    • 제10권1호
    • /
    • pp.7-12
    • /
    • 1982
  • Arvesen (1969) has shown a procedure which produces an approximate confidence interval for a variance component in unbalanced one-way classification model. In this paper, his work is extended to the case when the model of interest is unbalanced two-way classification. Following the procedure described in this paper, approximate confidence intervals are computed by a Monte Carlo simulation.

  • PDF

불균형 이분 데이터 분류분석을 위한 데이터마이닝 절차 (A Data Mining Procedure for Unbalanced Binary Classification)

  • 정한나;이정화;전치혁
    • 대한산업공학회지
    • /
    • 제36권1호
    • /
    • pp.13-21
    • /
    • 2010
  • The prediction of contract cancellation of customers is essential in insurance companies but it is a difficult problem because the customer database is large and the target or cancelled customers are a small proportion of the database. This paper proposes a new data mining approach to the binary classification by handling a large-scale unbalanced data. Over-sampling, clustering, regularized logistic regression and boosting are also incorporated in the proposed approach. The proposed approach was applied to a real data set in the area of insurance and the results were compared with some other classification techniques.

Semi-supervised Software Defect Prediction Model Based on Tri-training

  • Meng, Fanqi;Cheng, Wenying;Wang, Jingdong
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제15권11호
    • /
    • pp.4028-4042
    • /
    • 2021
  • Aiming at the problem of software defect prediction difficulty caused by insufficient software defect marker samples and unbalanced classification, a semi-supervised software defect prediction model based on a tri-training algorithm was proposed by combining feature normalization, over-sampling technology, and a Tri-training algorithm. First, the feature normalization method is used to smooth the feature data to eliminate the influence of too large or too small feature values on the model's classification performance. Secondly, the oversampling method is used to expand and sample the data, which solves the unbalanced classification of labelled samples. Finally, the Tri-training algorithm performs machine learning on the training samples and establishes a defect prediction model. The novelty of this model is that it can effectively combine feature normalization, oversampling techniques, and the Tri-training algorithm to solve both the under-labelled sample and class imbalance problems. Simulation experiments using the NASA software defect prediction dataset show that the proposed method outperforms four existing supervised and semi-supervised learning in terms of Precision, Recall, and F-Measure values.

Detecting Malicious Social Robots with Generative Adversarial Networks

  • Wu, Bin;Liu, Le;Dai, Zhengge;Wang, Xiujuan;Zheng, Kangfeng
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제13권11호
    • /
    • pp.5594-5615
    • /
    • 2019
  • Malicious social robots, which are disseminators of malicious information on social networks, seriously affect information security and network environments. The detection of malicious social robots is a hot topic and a significant concern for researchers. A method based on classification has been widely used for social robot detection. However, this method of classification is limited by an unbalanced data set in which legitimate, negative samples outnumber malicious robots (positive samples), which leads to unsatisfactory detection results. This paper proposes the use of generative adversarial networks (GANs) to extend the unbalanced data sets before training classifiers to improve the detection of social robots. Five popular oversampling algorithms were compared in the experiments, and the effects of imbalance degree and the expansion ratio of the original data on oversampling were studied. The experimental results showed that the proposed method achieved better detection performance compared with other algorithms in terms of the F1 measure. The GAN method also performed well when the imbalance degree was smaller than 15%.

New Splitting Criteria for Classification Trees

  • Lee, Yung-Seop
    • Communications for Statistical Applications and Methods
    • /
    • 제8권3호
    • /
    • pp.885-894
    • /
    • 2001
  • Decision tree methods is the one of data mining techniques. Classification trees are used to predict a class label. When a tree grows, the conventional splitting criteria use the weighted average of the left and the right child nodes for measuring the node impurity. In this paper, new splitting criteria for classification trees are proposed which improve the interpretablity of trees comparing to the conventional methods. The criteria search only for interesting subsets of the data, as opposed to modeling all of the data equally well. As a result, the tree is very unbalanced but extremely interpretable.

  • PDF

불균형 Haar 웨이블릿 변환을 이용한 군집화를 위한 시계열 표현 (Time series representation for clustering using unbalanced Haar wavelet transformation)

  • 이세훈;백창룡
    • 응용통계연구
    • /
    • 제31권6호
    • /
    • pp.707-719
    • /
    • 2018
  • 시계열 데이터의 분류와 군집화를 효율적으로 수행하기 위해 다양한 시계열 표현 방법들이 제안되었다. 본 연구는 Lin 등 (2007)이 제안한 국소 평균 근사를 이용하여 시계열의 차원을 축소한 후 심볼릭 자료로 이산화하는 symbolic aggregate approximation (SAX) 방법의 개선에 대해서 연구하였다. SAX는 국소 평균 근사를 할 때 등간격으로 임의의 개수의 세그먼트로 나누어 평균을 계산하여 세그먼트의 개수에 그 성능이 크게 좌우된다. 따라서 본 논문은 불균형 Haar 웨이블릿 변환을 통해 국소 평균 수준을 등간격이 아니라 자료의 특성을 반영하여 자료 의존적으로 선택하게 함으로써 시계열의 차원을 효과적으로 축소함과 동시에 정보의 손실을 줄이는 방법에 대해서 제안한다. 제안한 방법은 실증 자료 분석을 통해 SAX 방법을 개선시킴을 확인하였다.

계층구조적 분류모델을 이용한 심전도에서의 비정상 비트 검출 (Detection of Abnormal Heartbeat using Hierarchical Qassification in ECG)

  • 이도훈;조백환;박관수;송수화;이종실;지영준;김인영;김선일
    • 대한의용생체공학회:의공학회지
    • /
    • 제29권6호
    • /
    • pp.466-476
    • /
    • 2008
  • The more people use ambulatory electrocardiogram(ECG) for arrhythmia detection, the more researchers report the automatic classification algorithms. Most of the previous studies don't consider the un-balanced data distribution. Even in patients, there are much more normal beats than abnormal beats among the data from 24 hours. To solve this problem, the hierarchical classification using 21 features was adopted for arrhythmia abnormal beat detection. The features include R-R intervals and data to describe the morphology of the wave. To validate the algorithm, 44 non-pacemaker recordings from physionet were used. The hierarchical classification model with 2 stages on domain knowledge was constructed. Using our suggested method, we could improve the performance in abnormal beat classification from the conventional multi-class classification method. In conclusion, the domain knowledge based hierarchical classification is useful to the ECG beat classification with unbalanced data distribution.

The Case of Proportional Cell Frequencies for the Two-Way Cross-Classification with Interaction

  • Kim, Jong-Duk
    • Journal of the Korean Data and Information Science Society
    • /
    • 제9권2호
    • /
    • pp.119-138
    • /
    • 1998
  • The case of proportional cell frequencies for the two-way cross-classification with interaction is considered. Several types of hypotheses for the general unbalanced data that are commonly used in the literature are shown, and they are written out for this particular case. A reparameterized form of the cell means model is defined to establish the reparameterized model, and orthogonal property of the model is shown using the augmented matrix and the numerator sums of squares are computed. Different ways of producing the same analysis of variance tables are shown in both orthogonal and nonorthogonal situations.

  • PDF