• Title/Summary/Keyword: Imbalance data

Search Result 475, Processing Time 0.022 seconds

Data Processing of AutoML-based Classification Models for Improving Performance in Unbalanced Classes (불균형 클래스에서 AutoML 기반 분류 모델의 성능 향상을 위한 데이터 처리)

  • Lee, Dong-Joon;Kang, Ji-Soo;Chung, Kyungyong
    • Journal of Convergence for Information Technology
    • /
    • v.11 no.6
    • /
    • pp.49-54
    • /
    • 2021
  • With the recent development of smart healthcare technology, interest in daily diseases is increasing. However, healthcare data has an imbalance between positive and negative data. This is caused by the difficulty of collecting data because there are relatively many people who are not patients compared to patients with certain diseases. Data imbalances need to be adjusted because they affect performance in ongoing learning during disease prediction and analysis. Therefore, in this paper, We replace missing values through multiple imputation in detection models to determine whether they are prevalent or not, and resolve data imbalances through over-sampling. Based on AutoML using preprocessed data, We generate several models and select top 3 models to generate ensemble models.

Joint Compensation of Transmitter and Receiver IQ Imbalance in OFDM Systems Based on Selective Coefficient Updating

  • Rasi, Jafar;Tazehkand, Behzad Mozaffari;Niya, Javad Musevi
    • ETRI Journal
    • /
    • v.37 no.1
    • /
    • pp.43-53
    • /
    • 2015
  • In this paper, a selective coefficient updating (SCU) approach at each branch of the per-tone equalization (PTEQ) structure has been applied for insufficient cyclic prefix (CP) length. Because of the high number of adaptive filters and their complex adaption process in the PTEQ structure, SCU has been proposed. Using this method leads to a reduction in the computational complexity, while the performance remains almost unchanged. Moreover, the use of set-membership filtering with variable step size is proposed for a sufficient CP case to increase convergence speed and decrease the average number of calculations. Simulation results show that despite the aforementioned algorithms having similar performance in comparison with conventional algorithms, they are able to reduce the number of calculations necessary. In addition, compensation of both the channel effect and the transmitter/receiver in-phase/quadrature-phase imbalances are achievable by these algorithms.

The effects of the RMB's appreciation on trade balance in US

  • Gong, Chi;Liu, Zi-Yang
    • Journal of the Korea Society of Computer and Information
    • /
    • v.20 no.11
    • /
    • pp.135-142
    • /
    • 2015
  • This paper applied a VAR model to analyze the effects of RMB exchange rate brought to processing trade, non-processing trade and FDI. Then we can get the results that the appreciation of RMB could not solve the problem of US trade deficit. It is more likely that the appreciation just can transfer the trade imbalance to other country with US, which could not radically solve the economic problems of US. Also this paper find that the data of service trade is surplus while the main goods deficit was occur in advanced technology product, especially in the information & communications trade And US has real advantage in these industries, so the situation will be changed if US decreased the barrier in these industries. In that way, the imbalance situation should be greatly reduced.

A Study of Home Informatization and it′s Effect on the Family Resource Management - focused on the Internet Use- (가정정보화와 이로 인한 가정자원관리의 변화에 대한 연구 - 인터넷사용을 중심으로 -)

  • 이기영;이현아
    • Journal of Families and Better Life
    • /
    • v.20 no.1
    • /
    • pp.17-31
    • /
    • 2002
  • The purpose of this study is to investigate the effects of home informatization on the family resource management. For this purpose we analyze the level of home informatization focused on the Internet use and it's effects on the family resource management through time management and financial management. Data were collected from 582 housewives who use the Internet at home. The results show that home informatization through the Internet use has changed family resource management totally. It contributes to improve planning and efficiency of resource management, but simultaneously it causes the imbalance of resource management. And the Internet use of housewives also affects time allocation and household expenditure. These changes depend on socio-demographics variables, home informatization related variables, and personal resource variables. The results show that the ability to manage time and finance have much more importance to improve the level of planning and efficiency and to decrease the level of imbalance in managerial subsystem. The results of this research suggest several implications for public policy.

A Study of a Method for Maintaining Accuracy Uniformity When Using Long-tailed Dataset (불균형 데이터세트 학습에서 정확도 균일화를 위한 학습 방법에 관한 연구)

  • Geun-pyo Park;XinYu Piao;Jong-Kook Kim
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.05a
    • /
    • pp.585-587
    • /
    • 2023
  • Long-tailed datasets have an imbalanced distribution because they consist of a different number of data samples for each class. However, there are problems of the performance degradation in tail-classes and class-accuracy imbalance for all classes. To address these problems, this paper suggests a learning method for training of long-tailed dataset. The proposed method uses and combines two methods; one is a resampling method to generate a uniform mini-batch to prevent the performance degradation in tail-classes, and the other is a reweighting method to address the accuracy imbalance problem. The purpose of our proposed method is to train the learning models to have uniform accuracy for each class in a long-tailed dataset.

Context-Dependent Video Data Augmentation for Human Instance Segmentation (인물 개체 분할을 위한 맥락-의존적 비디오 데이터 보강)

  • HyunJin Chun;JongHun Lee;InCheol Kim
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.12 no.5
    • /
    • pp.217-228
    • /
    • 2023
  • Video instance segmentation is an intelligent visual task with high complexity because it not only requires object instance segmentation for each image frame constituting a video, but also requires accurate tracking of instances throughout the frame sequence of the video. In special, human instance segmentation in drama videos has an unique characteristic that requires accurate tracking of several main characters interacting in various places and times. Also, it is also characterized by a kind of the class imbalance problem because there is a significant difference between the frequency of main characters and that of supporting or auxiliary characters in drama videos. In this paper, we introduce a new human instance datatset called MHIS, which is built upon drama videos, Miseang, and then propose a novel video data augmentation method, CDVA, in order to overcome the data imbalance problem between character classes. Different from the previous video data augmentation methods, the proposed CDVA generates more realistic augmented videos by deciding the optimal location within the background clip for a target human instance to be inserted with taking rich spatio-temporal context embedded in videos into account. Therefore, the proposed augmentation method, CDVA, can improve the performance of a deep neural network model for video instance segmentation. Conducting both quantitative and qualitative experiments using the MHIS dataset, we prove the usefulness and effectiveness of the proposed video data augmentation method.

A Comparison of Ensemble Methods Combining Resampling Techniques for Class Imbalanced Data (데이터 전처리와 앙상블 기법을 통한 불균형 데이터의 분류모형 비교 연구)

  • Leea, Hee-Jae;Lee, Sungim
    • The Korean Journal of Applied Statistics
    • /
    • v.27 no.3
    • /
    • pp.357-371
    • /
    • 2014
  • There are many studies related to imbalanced data in which the class distribution is highly skewed. To address the problem of imbalanced data, previous studies deal with resampling techniques which correct the skewness of the class distribution in each sampled subset by using under-sampling, over-sampling or hybrid-sampling such as SMOTE. Ensemble methods have also alleviated the problem of class imbalanced data. In this paper, we compare around a dozen algorithms that combine the ensemble methods and resampling techniques based on simulated data sets generated by the Backbone model, which can handle the imbalance rate. The results on various real imbalanced data sets are also presented to compare the effectiveness of algorithms. As a result, we highly recommend the resampling technique combining ensemble methods for imbalanced data in which the proportion of the minority class is less than 10%. We also find that each ensemble method has a well-matched sampling technique. The algorithms which combine bagging or random forest ensembles with random undersampling tend to perform well; however, the boosting ensemble appears to perform better with over-sampling. All ensemble methods combined with SMOTE outperform in most situations.

Dynamically weighted loss based domain adversarial training for children's speech recognition (어린이 음성인식을 위한 동적 가중 손실 기반 도메인 적대적 훈련)

  • Seunghee, Ma
    • The Journal of the Acoustical Society of Korea
    • /
    • v.41 no.6
    • /
    • pp.647-654
    • /
    • 2022
  • Although the fields in which is utilized children's speech recognition is on the rise, the lack of quality data is an obstacle to improving children's speech recognition performance. This paper proposes a new method for improving children's speech recognition performance by additionally using adult speech data. The proposed method is a transformer based domain adversarial training using dynamically weighted loss to effectively address the data imbalance gap between age that grows as the amount of adult training data increases. Specifically, the degree of class imbalance in the mini-batch during training was quantified, and the loss function was defined and used so that the smaller the data, the greater the weight. Experiments validate the utility of proposed domain adversarial training following asymmetry between adults and children training data. Experiments show that the proposed method has higher children's speech recognition performance than traditional domain adversarial training method under all conditions in which asymmetry between age occurs in the training data.

Conditional Generative Adversarial Network based Collaborative Filtering Recommendation System (Conditional Generative Adversarial Network(CGAN) 기반 협업 필터링 추천 시스템)

  • Kang, Soyi;Shin, Kyung-shik
    • Journal of Intelligence and Information Systems
    • /
    • v.27 no.3
    • /
    • pp.157-173
    • /
    • 2021
  • With the development of information technology, the amount of available information increases daily. However, having access to so much information makes it difficult for users to easily find the information they seek. Users want a visualized system that reduces information retrieval and learning time, saving them from personally reading and judging all available information. As a result, recommendation systems are an increasingly important technologies that are essential to the business. Collaborative filtering is used in various fields with excellent performance because recommendations are made based on similar user interests and preferences. However, limitations do exist. Sparsity occurs when user-item preference information is insufficient, and is the main limitation of collaborative filtering. The evaluation value of the user item matrix may be distorted by the data depending on the popularity of the product, or there may be new users who have not yet evaluated the value. The lack of historical data to identify consumer preferences is referred to as data sparsity, and various methods have been studied to address these problems. However, most attempts to solve the sparsity problem are not optimal because they can only be applied when additional data such as users' personal information, social networks, or characteristics of items are included. Another problem is that real-world score data are mostly biased to high scores, resulting in severe imbalances. One cause of this imbalance distribution is the purchasing bias, in which only users with high product ratings purchase products, so those with low ratings are less likely to purchase products and thus do not leave negative product reviews. Due to these characteristics, unlike most users' actual preferences, reviews by users who purchase products are more likely to be positive. Therefore, the actual rating data is over-learned in many classes with high incidence due to its biased characteristics, distorting the market. Applying collaborative filtering to these imbalanced data leads to poor recommendation performance due to excessive learning of biased classes. Traditional oversampling techniques to address this problem are likely to cause overfitting because they repeat the same data, which acts as noise in learning, reducing recommendation performance. In addition, pre-processing methods for most existing data imbalance problems are designed and used for binary classes. Binary class imbalance techniques are difficult to apply to multi-class problems because they cannot model multi-class problems, such as objects at cross-class boundaries or objects overlapping multiple classes. To solve this problem, research has been conducted to convert and apply multi-class problems to binary class problems. However, simplification of multi-class problems can cause potential classification errors when combined with the results of classifiers learned from other sub-problems, resulting in loss of important information about relationships beyond the selected items. Therefore, it is necessary to develop more effective methods to address multi-class imbalance problems. We propose a collaborative filtering model using CGAN to generate realistic virtual data to populate the empty user-item matrix. Conditional vector y identify distributions for minority classes and generate data reflecting their characteristics. Collaborative filtering then maximizes the performance of the recommendation system via hyperparameter tuning. This process should improve the accuracy of the model by addressing the sparsity problem of collaborative filtering implementations while mitigating data imbalances arising from real data. Our model has superior recommendation performance over existing oversampling techniques and existing real-world data with data sparsity. SMOTE, Borderline SMOTE, SVM-SMOTE, ADASYN, and GAN were used as comparative models and we demonstrate the highest prediction accuracy on the RMSE and MAE evaluation scales. Through this study, oversampling based on deep learning will be able to further refine the performance of recommendation systems using actual data and be used to build business recommendation systems.

Improved Network Intrusion Detection Model through Hybrid Feature Selection and Data Balancing (Hybrid Feature Selection과 Data Balancing을 통한 효율적인 네트워크 침입 탐지 모델)

  • Min, Byeongjun;Ryu, Jihun;Shin, Dongkyoo;Shin, Dongil
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.10 no.2
    • /
    • pp.65-72
    • /
    • 2021
  • Recently, attacks on the network environment have been rapidly escalating and intelligent. Thus, the signature-based network intrusion detection system is becoming clear about its limitations. To solve these problems, research on machine learning-based intrusion detection systems is being conducted in many ways, but two problems are encountered to use machine learning for intrusion detection. The first is to find important features associated with learning for real-time detection, and the second is the imbalance of data used in learning. This problem is fatal because the performance of machine learning algorithms is data-dependent. In this paper, we propose the HSF-DNN, a network intrusion detection model based on a deep neural network to solve the problems presented above. The proposed HFS-DNN was learned through the NSL-KDD data set and performs performance comparisons with existing classification models. Experiments have confirmed that the proposed Hybrid Feature Selection algorithm does not degrade performance, and in an experiment between learning models that solved the imbalance problem, the model proposed in this paper showed the best performance.