• 제목/요약/키워드: Imbalanced data

검색결과 154건 처리시간 0.021초

CAB: Classifying Arrhythmias based on Imbalanced Sensor Data

  • Wang, Yilin;Sun, Le;Subramani, Sudha
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제15권7호
    • /
    • pp.2304-2320
    • /
    • 2021
  • Intelligently detecting anomalies in health sensor data streams (e.g., Electrocardiogram, ECG) can improve the development of E-health industry. The physiological signals of patients are collected through sensors. Timely diagnosis and treatment save medical resources, promote physical health, and reduce complications. However, it is difficult to automatically classify the ECG data, as the features of ECGs are difficult to extract. And the volume of labeled ECG data is limited, which affects the classification performance. In this paper, we propose a Generative Adversarial Network (GAN)-based deep learning framework (called CAB) for heart arrhythmia classification. CAB focuses on improving the detection accuracy based on a small number of labeled samples. It is trained based on the class-imbalance ECG data. Augmenting ECG data by a GAN model eliminates the impact of data scarcity. After data augmentation, CAB classifies the ECG data by using a Bidirectional Long Short Term Memory Recurrent Neural Network (Bi-LSTM). Experiment results show a better performance of CAB compared with state-of-the-art methods. The overall classification accuracy of CAB is 99.71%. The F1-scores of classifying Normal beats (N), Supraventricular ectopic beats (S), Ventricular ectopic beats (V), Fusion beats (F) and Unclassifiable beats (Q) heartbeats are 99.86%, 97.66%, 99.05%, 98.57% and 99.88%, respectively. Unclassifiable beats (Q) heartbeats are 99.86%, 97.66%, 99.05%, 98.57% and 99.88%, respectively.

Tropospheric Anomaly Detection in Multi-Reference Stations Environment during Localized Atmospheric Conditions-(2) : Analytic Results of Anomaly Detection Algorithm

  • Yoo, Yun-Ja
    • 한국항해항만학회지
    • /
    • 제40권5호
    • /
    • pp.271-278
    • /
    • 2016
  • Localized atmospheric conditions between multi-reference stations can bring the tropospheric delay irregularity that becomes an error terms affecting positioning accuracy in network RTK environment. Imbalanced network error can affect the network solutions and it can corrupt the entire network solution and degrade the correction accuracy. If an anomaly could be detected before the correction message was generated, it is possible to eliminate the anomalous satellite that can cause degradation of the network solution during the tropospheric delay anomaly. An atmospheric grid that consists of four meteorological stations was used to detect an inhomogeneous weather conditions and tropospheric anomaly applied AWSs (automatic weather stations) meteorological data. The threshold of anomaly detection algorithm was determined based on the statistical weather data of AWSs for 5 years in an atmospheric grid. From the analytic results of anomaly detection algorithm it showed that the proposed algorithm can detect an anomalous satellite with an anomaly flag generation caused tropospheric delay anomaly during localized atmospheric conditions between stations. It was shown that the different precipitation condition between stations is the main factor affecting tropospheric anomalies.

F_MixBERT: Sentiment Analysis Model using Focal Loss for Imbalanced E-commerce Reviews

  • Fengqian Pang;Xi Chen;Letong Li;Xin Xu;Zhiqiang Xing
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제18권2호
    • /
    • pp.263-283
    • /
    • 2024
  • Users' comments after online shopping are critical to product reputation and business improvement. These comments, sometimes known as e-commerce reviews, influence other customers' purchasing decisions. To confront large amounts of e-commerce reviews, automatic analysis based on machine learning and deep learning draws more and more attention. A core task therein is sentiment analysis. However, the e-commerce reviews exhibit the following characteristics: (1) inconsistency between comment content and the star rating; (2) a large number of unlabeled data, i.e., comments without a star rating, and (3) the data imbalance caused by the sparse negative comments. This paper employs Bidirectional Encoder Representation from Transformers (BERT), one of the best natural language processing models, as the base model. According to the above data characteristics, we propose the F_MixBERT framework, to more effectively use inconsistently low-quality and unlabeled data and resolve the problem of data imbalance. In the framework, the proposed MixBERT incorporates the MixMatch approach into BERT's high-dimensional vectors to train the unlabeled and low-quality data with generated pseudo labels. Meanwhile, data imbalance is resolved by Focal loss, which penalizes the contribution of large-scale data and easily-identifiable data to total loss. Comparative experiments demonstrate that the proposed framework outperforms BERT and MixBERT for sentiment analysis of e-commerce comments.

스마트 팩토리 반도체 공정 데이터 최적화를 위한 향상된 머신러닝 전처리 방법 연구 (Enhanced Machine Learning Preprocessing Techniques for Optimization of Semiconductor Process Data in Smart Factories)

  • 최승규;이승재;남춘성
    • 한국인터넷방송통신학회논문지
    • /
    • 제24권4호
    • /
    • pp.57-64
    • /
    • 2024
  • 스마트 팩토리의 도입은 제조업 분야에서 객관적이고 효율적인 라인 관리로의 전환을 가져왔다. 그러나 대부분의 회사가 매초 수집되는 수많은 센서 데이터를 효과적으로 사용하지 못하고 있다. 본 연구에서는 이러한 데이터를 활용해 제품 품질을 예측하고 효율적인 생산 공정의 관리를 목표로 한다. 보안 문제로 구체적인 센서 데이터 확인이 불가하여, "SAMSUNG SDS Brightics AI" 사이트의 반도체 공정 관련 학습용 데이터를 확보하여 연구를 진행한다. 머신러닝 모델에서 데이터의 전처리 과정은 성능을 결정짓는 중요한 요소이다. 따라서, 결측값 제거, 이상치 제거, 스케일링, 특성 제거의 전처리 과정을 통해 최적의 센서 데이터를 확보하였다. 또한, 학습 데이터셋이 불균형 데이터를 이루고 있어 오버샘플링 기법을 통해 동일한 비율을 맞추어 모델 평가 전 데이터를 준비하였다. 머신러닝에서 제공되는 다양한 모델 평가로 구한 SVM(rbf) 모델로 높은 성능(Accuracy : 97.07%, GM : 96.61%)을 확인했다. 또한, 동일한 데이터로 학습 시 "SAMSUNG SDS Brightics AI"에서 구현하였던 MLP 모델보다 더 높은 성능을 보인다. 본 연구는 센서 데이터를 활용한 양품/불량품 예측 외에도 부품 주기, 공정 조건 예측 등 다양한 주제에 적용 가능하다.

심층신경망을 활용한 Cochlodinium polykrikoides 적조 발생 예측 연구 (Study on Cochlodinium polykrikoides Red tide Prediction using Deep Neural Network under Imbalanced Data)

  • 박수호;정민지;황도현;엥흐자리갈 운자야;김나경;윤홍주
    • 한국전자통신학회논문지
    • /
    • 제14권6호
    • /
    • pp.1161-1170
    • /
    • 2019
  • 본 연구에서는 심층 신경망을 이용하여 Cochlodinium polykrikoides 적조 발생을 예측하는 모델을 제안한다. 적조 발생 예측을 위해 8개의 은닉층을 가진 심층 신경망을 구축하였다. 위성 재분석 자료와 기상수치모델 자료를 이용하여 과거 적조 발생해역의 해양 및 기상인자 총 59개를 추출하여 신경망 모델 학습에 활용하였다. 전체 데이터셋 중 적조 발생 사례는 적조 미발생 사례에 비해 매우 적어 불균형 데이터 문제가 발생하였다. 본 연구에서는 이를 해결하기 위해 과표집화(Over sampling) 기반 데이터 증식(Data augmentation) 기법을 적용하였다. 과거자료를 활용하여 모형의 정확도를 평가한 결과 약 97%의 정확도를 보였다.

기업부실 예측 데이터의 불균형 문제 해결을 위한 앙상블 학습 (Ensemble Learning for Solving Data Imbalance in Bankruptcy Prediction)

  • 김명종
    • 지능정보연구
    • /
    • 제15권3호
    • /
    • pp.1-15
    • /
    • 2009
  • 데이터 불균형 문제는 분류 및 예측 문제에서 하나의 범주에 속하는 표본의 수가 다른 범주들에 속하는 표본 수에 비하여 현저하게 적을 경우 나타난다. 데이터 불균형이 심화됨에 따라 범주 사이의 분류 경계영역이 왜곡되고 결과적으로 분류자의 학습성과가 저하되는 문제가 발생한다. 본 연구에서는 데이터 불균형 문제를 해결하기 위하여 Geometric Mean-based Boosting (GM-Boost) 알고리즘을 제안하고자 한다. GM-Boost 알고리즘은 기하평균 개념에 기초하고 있어 다수 범주와 소수 범주를 동시에 고려한 학습이 가능하고 오분류된 표본에 집중하여 학습을 강화할 수 있는 장점이 있다. 기업부실 예측문제를 활용하여 GM-Boost 알고리즘의 성과를 검증한 결과 기존의Under-Sampling, Over-Sampling 및 AdaBoost 알고리즘에 비하여 우수한 분류 정확성을 보여주었고 데이터 불균형 정도에 관계없이 견고한 학습성과를 나타냈다.

  • PDF

공공기술 사업화를 위한 CTGAN 기반 데이터 불균형 해소 (Resolving CTGAN-based data imbalance for commercialization of public technology)

  • 황철현
    • 한국정보통신학회논문지
    • /
    • 제26권1호
    • /
    • pp.64-69
    • /
    • 2022
  • 공공기술 사업화는 정부가 주도하는 과학기술의 혁신과 R&D 성과를 민간에 이전하는 것으로 경제 성장을 주도하는 핵심 성과로 인식되고 있다. 따라서 기술 이전을 활성화시키기 위해 성공 요인을 식별하거나 사업화 가능성이 높은 공공기술과 수요기업을 매칭하는 다양한 기계학습의 방법들이 연구되고 있다. 하지만 공공기술 사업화 데이터는 표 형태로 구성되어 있고, 성공-실패 비율이 큰 차이를 보이는 불균형 상태이기 때문에 기계학습 성능이 높지 않는 문제점을 가지고 있다. 이 논문에서는 표 형태로 구성된 공공기술 데이터에서 불균형을 해소하기 위해 CTGAN을 활용하는 방법을 제시한다. 또한 제시된 방법의 효과를 검증하기 위해 실제 공공기술 사업화 데이터를 활용하여 통계적 접근방법인 SMOTE와 비교 실험을 수행하였다. 다수의 실험 사례에서 CTGAN은 공공기술 사업화 성공사례를 안정적으로 예측하는 것을 확인하였다.

규칙 기반 분류 기법을 활용한 도로교량 안전등급 추정 모델 개발 (Developing an Estimation Model for Safety Rating of Road Bridges Using Rule-based Classification Method)

  • 정세환;임소람;지석호
    • 한국BIM학회 논문집
    • /
    • 제6권2호
    • /
    • pp.29-38
    • /
    • 2016
  • Road bridges are deteriorating gradually, and it is forecasted that the number of road bridges aging over 30 years will increase by more than 3 times of the current number. To maintain road bridges in a safe condition, current safety conditions of the bridges must be estimated for repair or reinforcement. However, budget and professional manpower required to perform in-depth inspections of road bridges are limited. This study proposes an estimation model for safety rating of road bridges by analyzing the data from Facility Management System (FMS) and Yearbook of Road Bridges and Tunnel. These data include basic specifications, year of completion, traffic, safety rating, and others. The distribution of safety rating was imbalanced, indicating 91% of road bridges have safety ratings of A or B. To improve classification performance, five safety ratings were integrated into two classes of G (good, A and B) and P (poor ratings under C). This rearrangement was set because facilities with ratings under C are required to be repaired or reinforced to recover their original functionality. 70% of the original data were used as training data, while the other 30% were used for validation. Data of class P in the training data were oversampled by 3 times, and Repeated Incremental Pruning to Produce Error Reduction (RIPPER) algorithm was used to develop the estimation model. The results of estimation model showed overall accuracy of 84.8%, true positive rate of 67.3%, and 29 classification rule. Year of completion was identified as the most critical factor on affecting lower safety ratings of bridges.

고차원 범주형 자료를 위한 비지도 연관성 기반 범주형 변수 선택 방법 (Association-based Unsupervised Feature Selection for High-dimensional Categorical Data)

  • 이창기;정욱
    • 품질경영학회지
    • /
    • 제47권3호
    • /
    • pp.537-552
    • /
    • 2019
  • Purpose: The development of information technology makes it easy to utilize high-dimensional categorical data. In this regard, the purpose of this study is to propose a novel method to select the proper categorical variables in high-dimensional categorical data. Methods: The proposed feature selection method consists of three steps: (1) The first step defines the goodness-to-pick measure. In this paper, a categorical variable is relevant if it has relationships among other variables. According to the above definition of relevant variables, the goodness-to-pick measure calculates the normalized conditional entropy with other variables. (2) The second step finds the relevant feature subset from the original variables set. This step decides whether a variable is relevant or not. (3) The third step eliminates redundancy variables from the relevant feature subset. Results: Our experimental results showed that the proposed feature selection method generally yielded better classification performance than without feature selection in high-dimensional categorical data, especially as the number of irrelevant categorical variables increase. Besides, as the number of irrelevant categorical variables that have imbalanced categorical values is increasing, the difference in accuracy between the proposed method and the existing methods being compared increases. Conclusion: According to experimental results, we confirmed that the proposed method makes it possible to consistently produce high classification accuracy rates in high-dimensional categorical data. Therefore, the proposed method is promising to be used effectively in high-dimensional situation.

데이터 불균형을 고려한 설명 가능한 인공지능 기반 기업부도예측 방법론 연구 (A Methodology for Bankruptcy Prediction in Imbalanced Datasets using eXplainable AI)

  • 허선우;백동현
    • 산업경영시스템학회지
    • /
    • 제45권2호
    • /
    • pp.65-76
    • /
    • 2022
  • Recently, not only traditional statistical techniques but also machine learning algorithms have been used to make more accurate bankruptcy predictions. But the insolvency rate of companies dealing with financial institutions is very low, resulting in a data imbalance problem. In particular, since data imbalance negatively affects the performance of artificial intelligence models, it is necessary to first perform the data imbalance process. In additional, as artificial intelligence algorithms are advanced for precise decision-making, regulatory pressure related to securing transparency of Artificial Intelligence models is gradually increasing, such as mandating the installation of explanation functions for Artificial Intelligence models. Therefore, this study aims to present guidelines for eXplainable Artificial Intelligence-based corporate bankruptcy prediction methodology applying SMOTE techniques and LIME algorithms to solve a data imbalance problem and model transparency problem in predicting corporate bankruptcy. The implications of this study are as follows. First, it was confirmed that SMOTE can effectively solve the data imbalance issue, a problem that can be easily overlooked in predicting corporate bankruptcy. Second, through the LIME algorithm, the basis for predicting bankruptcy of the machine learning model was visualized, and derive improvement priorities of financial variables that increase the possibility of bankruptcy of companies. Third, the scope of application of the algorithm in future research was expanded by confirming the possibility of using SMOTE and LIME through case application.