• Title/Summary/Keyword: Unbalanced Data

Search Result 322, Processing Time 0.022 seconds

Classification Analysis for Unbalanced Data (불균형 자료에 대한 분류분석)

  • Kim, Dongah;Kang, Suyeon;Song, Jongwoo
    • The Korean Journal of Applied Statistics
    • /
    • v.28 no.3
    • /
    • pp.495-509
    • /
    • 2015
  • We study a classification problem of significant differences in the proportion of two groups known as the unbalanced classification problem. It is usually more difficult to classify classes accurately in unbalanced data than balanced data. Most observations are likely to be classified to the bigger group if we apply classification methods to the unbalanced data because it can minimize the misclassification loss. However, this smaller group is misclassified as the larger group problem that can cause a bigger loss in most real applications. We compare several classification methods for the unbalanced data using sampling techniques (up and down sampling). We also check the total loss of different classification methods when the asymmetric loss is applied to simulated and real data. We use the misclassification rate, G-mean, ROC and AUC (area under the curve) for the performance comparison.

Fault Detection of Unbalanced Cycle Signal Data Using SOM-based Feature Signal Extraction Method (SOM기반 특징 신호 추출 기법을 이용한 불균형 주기 신호의 이상 탐지)

  • Kim, Song-Ee;Kang, Ji-Hoon;Park, Jong-Hyuck;Kim, Sung-Shick;Baek, Jun-Geol
    • Journal of the Korea Society for Simulation
    • /
    • v.21 no.2
    • /
    • pp.79-90
    • /
    • 2012
  • In this paper, a feature signal extraction method is proposed in order to enhance the low performance of fault detection caused by unbalanced data which denotes the situations when severe disparity exists between the numbers of class instances. Most of the cyclic signals gathered during the process are recognized as normal, while only a few signals are regarded as fault; the majorities of cyclic signals data are unbalanced data. SOM(Self-Organizing Map)-based feature signal extraction method is considered to fix the adverse effects caused by unbalanced data. The weight neurons, mapped to the every node of SOM grid, are extracted as the feature signals of both class data which are used as a reference data set for fault detection. kNN(k-Nearest Neighbor) and SVM(Support Vector Machine) are considered to make fault detection models with comparisons to Hotelling's $T^2$ Control Chart, the most widely used method for fault detection. Experiments are conducted by using simulated process signals which resembles the frequent cyclic signals in semiconductor manufacturing.

A Study on the Unbalanced Current Distribution of HTS Power Cable (초전도 전력케이블의 전류 불평형에 관한 연구)

  • Kim, Jae-Ho;Park, Chung-Hwa
    • Journal of the Korean Society of Safety
    • /
    • v.27 no.6
    • /
    • pp.43-47
    • /
    • 2012
  • The unbalance currents flow the High Temperature Superconducting (HTS) power cable caused by asymmetrical fault, harmonic distortion and unbalanced load. That problem causes additional loss and leakage field in the HTS power cable, and deteriorates the electric power quality and stability. In addition, large amounts of unbalanced current can cause negative sequence and ground relays to operate. This paper presents an analysis unbalanced three-phase current distribution in HTS power cable caused by unbalanced load condition and grounding methods using PSCAD/EMTDC. The results obtained through the analysis would provide important data for the design of HTS power cables and valid information for their installation in power system.

A Data Mining Procedure for Unbalanced Binary Classification (불균형 이분 데이터 분류분석을 위한 데이터마이닝 절차)

  • Jung, Han-Na;Lee, Jeong-Hwa;Jun, Chi-Hyuck
    • Journal of Korean Institute of Industrial Engineers
    • /
    • v.36 no.1
    • /
    • pp.13-21
    • /
    • 2010
  • The prediction of contract cancellation of customers is essential in insurance companies but it is a difficult problem because the customer database is large and the target or cancelled customers are a small proportion of the database. This paper proposes a new data mining approach to the binary classification by handling a large-scale unbalanced data. Over-sampling, clustering, regularized logistic regression and boosting are also incorporated in the proposed approach. The proposed approach was applied to a real data set in the area of insurance and the results were compared with some other classification techniques.

Discriminant analysis for unbalanced data using HDBSCAN (불균형자료를 위한 판별분석에서 HDBSCAN의 활용)

  • Lee, Bo-Hui;Kim, Tae-Heon;Choi, Yong-Seok
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.4
    • /
    • pp.599-609
    • /
    • 2021
  • Data with a large difference in the number of objects between clusters are called unbalanced data. In discriminant analysis of unbalanced data, it is more important to classify objects in minority categories than to classify objects in majority categories well. However, objects in minority categories are often misclassified into majority categories. In this study, we propose a method that combined hierarchical DBSCAN (HDBSCAN) and SMOTE to solve this problem. Using HDBSCAN, it removes noise in minority categories and majority categories. Then it applies SMOTE to create new data. Area under the roc curve (AUC) and F1 scores were used to compare performance with existing methods. As a result, in most cases, the method combining HDBSCAN and synthetic minority oversampling technique (SMOTE) showed a high performance index, and it was found to be an excellent method for classifying unbalanced data.

Dietary Habit and Unbalanced Diet Status of Young Children by Age (유아의 나이에 따른 편식 및 식습관 실태)

  • Jung, You-Mi
    • Journal of the Korean Society of Food Culture
    • /
    • v.34 no.5
    • /
    • pp.587-594
    • /
    • 2019
  • This study investigated the general information, unbalanced diet, and dietary habits of 86 children in Daegu. The research was undertaken to analyze the current state of diet and dietary habits of children, and to provide basic data for nutrition education. The results reveal that younger children have a more unbalanced diet. Children dislike side-dishes the most. Furthermore, due to the longer time taken to consume food, parents persuade children to eat quickly. Children were also determined to have a high intake of foods and drinks containing sugar; beverages containing sugar are consumed 1-2 times a week by 5-year-olds, and once daily by 6- and 7-year-olds. The results of this study can be applied to provide basic data for nutritional education, and assist in the development of dietary programs for young children.

Noninformative Priors for Fieller-Creasy Problem using Unbalanced Data

  • Kim, Dal-Ho;Lee, Woo-Dong;Kang, Sang-Gil
    • 한국데이터정보과학회:학술대회논문집
    • /
    • 2005.10a
    • /
    • pp.71-84
    • /
    • 2005
  • The Fieller-Creasy problem involves statistical inference about the ratio of two independent normal means. It is difficult problem from either a frequentist or a likelihood perspective. As an alternatives, a Bayesian analysis with noninformative priors may provide a solution to this problem. In this paper, we extend the results of Yin and Ghosh (2001) to unbalanced sample case. We find various noninformative priors such as first and second order matching priors, reference and Jeffreys' priors. The posterior propriety under the proposed noninformative priors will be given. Using real data, we provide illustrative examples. Through simulation study, we compute the frequentist coverage probabilities for probability matching and reference priors. Some simulation results will be given.

  • PDF

Integrated Partial Sufficient Dimension Reduction with Heavily Unbalanced Categorical Predictors

  • Yoo, Jae-Keun
    • The Korean Journal of Applied Statistics
    • /
    • v.23 no.5
    • /
    • pp.977-985
    • /
    • 2010
  • In this paper, we propose an approach to conduct partial sufficient dimension reduction with heavily unbalanced categorical predictors. For this, we consider integrated categorical predictors and investigate certain conditions that the integrated categorical predictor is fully informative to partial sufficient dimension reduction. For illustration, the proposed approach is implemented on optimal partial sliced inverse regression in simulation and data analysis.

RDP: A storage-tier-aware Robust Data Placement strategy for Hadoop in a Cloud-based Heterogeneous Environment

  • Muhammad Faseeh Qureshi, Nawab;Shin, Dong Ryeol
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.10 no.9
    • /
    • pp.4063-4086
    • /
    • 2016
  • Cloud computing is a robust technology, which facilitate to resolve many parallel distributed computing issues in the modern Big Data environment. Hadoop is an ecosystem, which process large data-sets in distributed computing environment. The HDFS is a filesystem of Hadoop, which process data blocks to the cluster nodes. The data block placement has become a bottleneck to overall performance in a Hadoop cluster. The current placement policy assumes that, all Datanodes have equal computing capacity to process data blocks. This computing capacity includes availability of same storage media and same processing performances of a node. As a result, Hadoop cluster performance gets effected with unbalanced workloads, inefficient storage-tier, network traffic congestion and HDFS integrity issues. This paper proposes a storage-tier-aware Robust Data Placement (RDP) scheme, which systematically resolves unbalanced workloads, reduces network congestion to an optimal state, utilizes storage-tier in a useful manner and minimizes the HDFS integrity issues. The experimental results show that the proposed approach reduced unbalanced workload issue to 72%. Moreover, the presented approach resolve storage-tier compatibility problem to 81% by predicting storage for block jobs and improved overall data block placement by 78% through pre-calculated computing capacity allocations and execution of map files over respective Namenode and Datanodes.

Empirical Statistical Power for Testing Multilocus Genotypic Effects under Unbalanced Designs Using a Gibbs Sampler

  • Lee, Chae-Young
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.25 no.11
    • /
    • pp.1511-1514
    • /
    • 2012
  • Epistasis that may explain a large portion of the phenotypic variation for complex economic traits of animals has been ignored in many genetic association studies. A Baysian method was introduced to draw inferences about multilocus genotypic effects based on their marginal posterior distributions by a Gibbs sampler. A simulation study was conducted to provide statistical powers under various unbalanced designs by using this method. Data were simulated by combined designs of number of loci, within genotype variance, and sample size in unbalanced designs with or without null combined genotype cells. Mean empirical statistical power was estimated for testing posterior mean estimate of combined genotype effect. A practical example for obtaining empirical statistical power estimates with a given sample size was provided under unbalanced designs. The empirical statistical powers would be useful for determining an optimal design when interactive associations of multiple loci with complex phenotypes were examined.