• Title/Summary/Keyword: Imbalanced Data

Search Result 162, Processing Time 0.024 seconds

Video augmentation technique for human action recognition using genetic algorithm

  • Nida, Nudrat;Yousaf, Muhammad Haroon;Irtaza, Aun;Velastin, Sergio A.
    • ETRI Journal
    • /
    • v.44 no.2
    • /
    • pp.327-338
    • /
    • 2022
  • Classification models for human action recognition require robust features and large training sets for good generalization. However, data augmentation methods are employed for imbalanced training sets to achieve higher accuracy. These samples generated using data augmentation only reflect existing samples within the training set, their feature representations are less diverse and hence, contribute to less precise classification. This paper presents new data augmentation and action representation approaches to grow training sets. The proposed approach is based on two fundamental concepts: virtual video generation for augmentation and representation of the action videos through robust features. Virtual videos are generated from the motion history templates of action videos, which are convolved using a convolutional neural network, to generate deep features. Furthermore, by observing an objective function of the genetic algorithm, the spatiotemporal features of different samples are combined, to generate the representations of the virtual videos and then classified through an extreme learning machine classifier on MuHAVi-Uncut, iXMAS, and IAVID-1 datasets.

Compound Loss Function of semantic segmentation models for imbalanced construction data

  • Chern, Wei-Chih;Kim, Hongjo;Asari, Vijayan;Nguyen, Tam
    • International conference on construction engineering and project management
    • /
    • 2022.06a
    • /
    • pp.808-813
    • /
    • 2022
  • This study presents the problems of data imbalance, varying difficulties across target objects, and small objects in construction object segmentation for far-field monitoring and utilize compound loss functions to address it. Construction site scenes of assembling scaffolds were analyzed to test the effectiveness of compound loss functions for five construction object classes---workers, hardhats, harnesses, straps, hooks. The challenging problem was mitigated by employing a focal and Jaccard loss terms in the original loss function of LinkNet segmentation model. The findings indicates the importance of the loss function design for model performance on construction site scenes for far-field monitoring.

  • PDF

Optimization Driven MapReduce Framework for Indexing and Retrieval of Big Data

  • Abdalla, Hemn Barzan;Ahmed, Awder Mohammed;Al Sibahee, Mustafa A.
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.14 no.5
    • /
    • pp.1886-1908
    • /
    • 2020
  • With the technical advances, the amount of big data is increasing day-by-day such that the traditional software tools face a burden in handling them. Additionally, the presence of the imbalance data in big data is a massive concern to the research industry. In order to assure the effective management of big data and to deal with the imbalanced data, this paper proposes a new indexing algorithm for retrieving big data in the MapReduce framework. In mappers, the data clustering is done based on the Sparse Fuzzy-c-means (Sparse FCM) algorithm. The reducer combines the clusters generated by the mapper and again performs data clustering with the Sparse FCM algorithm. The two-level query matching is performed for determining the requested data. The first level query matching is performed for determining the cluster, and the second level query matching is done for accessing the requested data. The ranking of data is performed using the proposed Monarch chaotic whale optimization algorithm (M-CWOA), which is designed by combining Monarch butterfly optimization (MBO) [22] and chaotic whale optimization algorithm (CWOA) [21]. Here, the Parametric Enabled-Similarity Measure (PESM) is adapted for matching the similarities between two datasets. The proposed M-CWOA outperformed other methods with maximal precision of 0.9237, recall of 0.9371, F1-score of 0.9223, respectively.

Training Data Sets Construction from Large Data Set for PCB Character Recognition

  • NDAYISHIMIYE, Fabrice;Gang, Sumyung;Lee, Joon Jae
    • Journal of Multimedia Information System
    • /
    • v.6 no.4
    • /
    • pp.225-234
    • /
    • 2019
  • Deep learning has become increasingly popular in both academic and industrial areas nowadays. Various domains including pattern recognition, Computer vision have witnessed the great power of deep neural networks. However, current studies on deep learning mainly focus on quality data sets with balanced class labels, while training on bad and imbalanced data set have been providing great challenges for classification tasks. We propose in this paper a method of data analysis-based data reduction techniques for selecting good and diversity data samples from a large dataset for a deep learning model. Furthermore, data sampling techniques could be applied to decrease the large size of raw data by retrieving its useful knowledge as representatives. Therefore, instead of dealing with large size of raw data, we can use some data reduction techniques to sample data without losing important information. We group PCB characters in classes and train deep learning on the ResNet56 v2 and SENet model in order to improve the classification performance of optical character recognition (OCR) character classifier.

TANFIS Classifier Integrated Efficacious Aassistance System for Heart Disease Prediction using CNN-MDRP

  • Bhaskaru, O.;Sreedevi, M.
    • International Journal of Computer Science & Network Security
    • /
    • v.22 no.10
    • /
    • pp.171-176
    • /
    • 2022
  • A dramatic rise in the number of people dying from heart disease has prompted efforts to find a way to identify it sooner using efficient approaches. A variety of variables contribute to the condition and even hereditary factors. The current estimate approaches use an automated diagnostic system that fails to attain a high level of accuracy because it includes irrelevant dataset information. This paper presents an effective neural network with convolutional layers for classifying clinical data that is highly class-imbalanced. Traditional approaches rely on massive amounts of data rather than precise predictions. Data must be picked carefully in order to achieve an earlier prediction process. It's a setback for analysis if the data obtained is just partially complete. However, feature extraction is a major challenge in classification and prediction since increased data increases the training time of traditional machine learning classifiers. The work integrates the CNN-MDRP classifier (convolutional neural network (CNN)-based efficient multimodal disease risk prediction with TANFIS (tuned adaptive neuro-fuzzy inference system) for earlier accurate prediction. Perform data cleaning by transforming partial data to informative data from the dataset in this project. The recommended TANFIS tuning parameters are then improved using a Laplace Gaussian mutation-based grasshopper and moth flame optimization approach (LGM2G). The proposed approach yields a prediction accuracy of 98.40 percent when compared to current algorithms.

Image-to-Image Translation with GAN for Synthetic Data Augmentation in Plant Disease Datasets

  • Nazki, Haseeb;Lee, Jaehwan;Yoon, Sook;Park, Dong Sun
    • Smart Media Journal
    • /
    • v.8 no.2
    • /
    • pp.46-57
    • /
    • 2019
  • In recent research, deep learning-based methods have achieved state-of-the-art performance in various computer vision tasks. However, these methods are commonly supervised, and require huge amounts of annotated data to train. Acquisition of data demands an additional costly effort, particularly for the tasks where it becomes challenging to obtain large amounts of data considering the time constraints and the requirement of professional human diligence. In this paper, we present a data level synthetic sampling solution to learn from small and imbalanced data sets using Generative Adversarial Networks (GANs). The reason for using GANs are the challenges posed in various fields to manage with the small datasets and fluctuating amounts of samples per class. As a result, we present an approach that can improve learning with respect to data distributions, reducing the partiality introduced by class imbalance and hence shifting the classification decision boundary towards more accurate results. Our novel method is demonstrated on a small dataset of 2789 tomato plant disease images, highly corrupted with class imbalance in 9 disease categories. Moreover, we evaluate our results in terms of different metrics and compare the quality of these results for distinct classes.

Comprehensive analysis of deep learning-based target classifiers in small and imbalanced active sonar datasets (소량 및 불균형 능동소나 데이터세트에 대한 딥러닝 기반 표적식별기의 종합적인 분석)

  • Geunhwan Kim;Youngsang Hwang;Sungjin Shin;Juho Kim;Soobok Hwang;Youngmin Choo
    • The Journal of the Acoustical Society of Korea
    • /
    • v.42 no.4
    • /
    • pp.329-344
    • /
    • 2023
  • In this study, we comprehensively analyze the generalization performance of various deep learning-based active sonar target classifiers when applied to small and imbalanced active sonar datasets. To generate the active sonar datasets, we use data from two different oceanic experiments conducted at different times and ocean. Each sample in the active sonar datasets is a time-frequency domain image, which is extracted from audio signal of contact after the detection process. For the comprehensive analysis, we utilize 22 Convolutional Neural Networks (CNN) models. Two datasets are used as train/validation datasets and test datasets, alternatively. To calculate the variance in the output of the target classifiers, the train/validation/test datasets are repeated 10 times. Hyperparameters for training are optimized using Bayesian optimization. The results demonstrate that shallow CNN models show superior robustness and generalization performance compared to most of deep CNN models. The results from this paper can serve as a valuable reference for future research directions in deep learning-based active sonar target classification.

Data anomaly detection for structural health monitoring using a combination network of GANomaly and CNN

  • Liu, Gaoyang;Niu, Yanbo;Zhao, Weijian;Duan, Yuanfeng;Shu, Jiangpeng
    • Smart Structures and Systems
    • /
    • v.29 no.1
    • /
    • pp.53-62
    • /
    • 2022
  • The deployment of advanced structural health monitoring (SHM) systems in large-scale civil structures collects large amounts of data. Note that these data may contain multiple types of anomalies (e.g., missing, minor, outlier, etc.) caused by harsh environment, sensor faults, transfer omission and other factors. These anomalies seriously affect the evaluation of structural performance. Therefore, the effective analysis and mining of SHM data is an extremely important task. Inspired by the deep learning paradigm, this study develops a novel generative adversarial network (GAN) and convolutional neural network (CNN)-based data anomaly detection approach for SHM. The framework of the proposed approach includes three modules : (a) A three-channel input is established based on fast Fourier transform (FFT) and Gramian angular field (GAF) method; (b) A GANomaly is introduced and trained to extract features from normal samples alone for class-imbalanced problems; (c) Based on the output of GANomaly, a CNN is employed to distinguish the types of anomalies. In addition, a dataset-oriented method (i.e., multistage sampling) is adopted to obtain the optimal sampling ratios between all different samples. The proposed approach is tested with acceleration data from an SHM system of a long-span bridge. The results show that the proposed approach has a higher accuracy in detecting the multi-pattern anomalies of SHM data.

Heuristic-Based Algorithm for Production Planning Considering Allocation Rate Conformance to Prevent Unstable Production Chain

  • Kim, Taehun;Ji, Bongjun;Cho, Hyunbo
    • Industrial Engineering and Management Systems
    • /
    • v.14 no.4
    • /
    • pp.413-419
    • /
    • 2015
  • This study solved the problem of unstable production chains by considering allocation rate conformance. We proposed two phased algorithm suitable for solving production planning that considers allocation rate conformance; the first phase was heuristic initial solution generation, and the second phase was tabu-search based solution improvement. By using three data sets which have different sizes of data and three different criteria, the results of proposed algorithm were compared with MIP results. The proposed algorithm showed the best production plan in terms of allocation rate conformance, and it was appropriate for other criteria; it solved the problem of unstable production chains by solving concentrated and unfair allocation.

A Method of Bank Telemarketing Customer Prediction based on Hybrid Sampling and Stacked Deep Networks (혼성 표본 추출과 적층 딥 네트워크에 기반한 은행 텔레마케팅 고객 예측 방법)

  • Lee, Hyunjin
    • Journal of Korea Society of Digital Industry and Information Management
    • /
    • v.15 no.3
    • /
    • pp.197-206
    • /
    • 2019
  • Telemarketing has been used in finance due to the reduction of offline channels. In order to select telemarketing target customers, various machine learning techniques have emerged to maximize the effect of minimum cost. However, there are problems that the class imbalance, which the number of marketing success customers is smaller than the number of failed customers, and the recall rate is lower than accuracy. In this paper, we propose a method that solve the imbalanced class problem and increase the recall rate to improve the efficiency. The hybrid sampling method is applied to balance the data in the class, and the stacked deep network is applied to improve the recall and precision as well as the accuracy. The proposed method is applied to actual bank telemarketing data. As a result of the comparison experiment, the accuracy, the recall, and the precision is improved higher than that of the conventional methods.