• Title/Summary/Keyword: Imbalanced Dataset

Search Result 49, Processing Time 0.025 seconds

Study on Cochlodinium polykrikoides Red tide Prediction using Deep Neural Network under Imbalanced Data (심층신경망을 활용한 Cochlodinium polykrikoides 적조 발생 예측 연구)

  • Bak, Su-Ho;Jeong, Min-Ji;Hwang, Do-Hyun;Enkhjargal, Unuzaya;Kim, Na-Kyeong;Yoon, Hong-Joo
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.14 no.6
    • /
    • pp.1161-1170
    • /
    • 2019
  • In this study, we propose a model for predicting Cochlodinium polykrikoides red tide occurrence using deep neural networks. A deep neural network with eight hidden layers was constructed to predict red tide occurrence. The 59 marine and meteorological factors were extracted and used for neural network model training using satellite reanalysis data and meteorological model data. The red tide occurred in the entire dataset is very small compared to the case of no red tide, resulting in an unbalanced data problem. In this study, we applied over sampling with adding noise based data augmentation to solve this problem. As a result of evaluating the accuracy of the model using test data, the accuracy was about 97%.

Attention Gated FC-DenseNet for Extracting Crop Cultivation Area by Multispectral Satellite Imagery (다중분광밴드 위성영상의 작물재배지역 추출을 위한 Attention Gated FC-DenseNet)

  • Seong, Seon-kyeong;Mo, Jun-sang;Na, Sang-il;Choi, Jae-wan
    • Korean Journal of Remote Sensing
    • /
    • v.37 no.5_1
    • /
    • pp.1061-1070
    • /
    • 2021
  • In this manuscript, we tried to improve the performance of the FC-DenseNet by applying an attention gate for the classification of cropping areas. The attention gate module could facilitate the learning of a deep learning model and improve the performance of the model by injecting of spatial/spectral weights to each feature map. Crop classification was performed in the onion and garlic regions using a proposed deep learning model in which an attention gate was added to the skip connection part of FC-DenseNet. Training data was produced using various PlanetScope satellite imagery, and preprocessing was applied to minimize the problem of imbalanced training dataset. As a result of the crop classification, it was verified that the proposed deep learning model can more effectively classify the onion and garlic regions than existing FC-DenseNet algorithm.

Prediction for Periodontal Disease using Gene Expression Profile Data based on Machine Learning (기계학습 기반 유전자 발현 데이터를 이용한 치주질환 예측)

  • Rhee, Je-Keun
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.23 no.8
    • /
    • pp.903-909
    • /
    • 2019
  • Periodontal disease is observed in many adult persons. However we has not clear know the molecular mechanism and how to treat the disease at the molecular levels. Here, we investigated the molecular differences between periodontal disease and normal controls using gene expression data. In particular, we checked whether the periodontal disease and normal tissues would be classified by machine learning algorithms using gene expression data. Moreover, we revealed the differentially expression genes and their function. As a result, we revealed that the periodontal disease and normal control samples were clearly clustered. In addition, by applying several classification algorithms, such as decision trees, random forests, support vector machines, the two samples were classified well with high accuracy, sensitivity and specificity, even though the dataset was imbalanced. Finally, we found that the genes which were related to inflammation and immune response, were usually have distinct patterns between the two classes.

Multivariate Outlier Removing for the Risk Prediction of Gas Leakage based Methane Gas (메탄 가스 기반 가스 누출 위험 예측을 위한 다변량 특이치 제거)

  • Dashdondov, Khongorzul;Kim, Mi-Hye
    • Journal of the Korea Convergence Society
    • /
    • v.11 no.12
    • /
    • pp.23-30
    • /
    • 2020
  • In this study, the relationship between natural gas (NG) data and gas-related environmental elements was performed using machine learning algorithms to predict the level of gas leakage risk without directly measuring gas leakage data. The study was based on open data provided by the server using the IoT-based remote control Picarro gas sensor specification. The naturel gas leaks into the air, it is a big problem for air pollution, environment and the health. The proposed method is multivariate outlier removing method based Random Forest (RF) classification for predicting risk of NG leak. After, unsupervised k-means clustering, the experimental dataset has done imbalanced data. Therefore, we focusing our proposed models can predict medium and high risk so best. In this case, we compared the receiver operating characteristic (ROC) curve, accuracy, area under the ROC curve (AUC), and mean standard error (MSE) for each classification model. As a result of our experiments, the evaluation measurements include accuracy, area under the ROC curve (AUC), and MSE; 99.71%, 99.57%, and 0.0016 for MOL_RF respectively.

Age and Gender Classification with Small Scale CNN (소규모 합성곱 신경망을 사용한 연령 및 성별 분류)

  • Jamoliddin, Uraimov;Yoo, Jae Hung
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.17 no.1
    • /
    • pp.99-104
    • /
    • 2022
  • Artificial intelligence is getting a crucial part of our lives with its incredible benefits. Machines outperform humans in recognizing objects in images, particularly in classifying people into correct age and gender groups. In this respect, age and gender classification has been one of the hot topics among computer vision researchers in recent decades. Deployment of deep Convolutional Neural Network(: CNN) models achieved state-of-the-art performance. However, the most of CNN based architectures are very complex with several dozens of training parameters so they require much computation time and resources. For this reason, we propose a new CNN-based classification algorithm with significantly fewer training parameters and training time compared to the existing methods. Despite its less complexity, our model shows better accuracy of age and gender classification on the UTKFace dataset.

A hierarchical semantic segmentation framework for computer vision-based bridge damage detection

  • Jingxiao Liu;Yujie Wei ;Bingqing Chen;Hae Young Noh
    • Smart Structures and Systems
    • /
    • v.31 no.4
    • /
    • pp.325-334
    • /
    • 2023
  • Computer vision-based damage detection enables non-contact, efficient and low-cost bridge health monitoring, which reduces the need for labor-intensive manual inspection or that for a large number of on-site sensing instruments. By leveraging recent semantic segmentation approaches, we can detect regions of critical structural components and identify damages at pixel level on images. However, existing methods perform poorly when detecting small and thin damages (e.g., cracks); the problem is exacerbated by imbalanced samples. To this end, we incorporate domain knowledge to introduce a hierarchical semantic segmentation framework that imposes a hierarchical semantic relationship between component categories and damage types. For instance, certain types of concrete cracks are only present on bridge columns, and therefore the noncolumn region may be masked out when detecting such damages. In this way, the damage detection model focuses on extracting features from relevant structural components and avoid those from irrelevant regions. We also utilize multi-scale augmentation to preserve contextual information of each image, without losing the ability to handle small and/or thin damages. In addition, our framework employs an importance sampling, where images with rare components are sampled more often, to address sample imbalance. We evaluated our framework on a public synthetic dataset that consists of 2,000 railway bridges. Our framework achieves a 0.836 mean intersection over union (IoU) for structural component segmentation and a 0.483 mean IoU for damage segmentation. Our results have in total 5% and 18% improvements for the structural component segmentation and damage segmentation tasks, respectively, compared to the best-performing baseline model.

Model Interpretation through LIME and SHAP Model Sharing (LIME과 SHAP 모델 공유에 의한 모델 해석)

  • Yong-Gil Kim
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.24 no.2
    • /
    • pp.177-184
    • /
    • 2024
  • In the situation of increasing data at fast speed, we use all kinds of complex ensemble and deep learning algorithms to get the highest accuracy. It's sometimes questionable how these models predict, classify, recognize, and track unknown data. Accomplishing this technique and more has been and would be the goal of intensive research and development in the data science community. A variety of reasons, such as lack of data, imbalanced data, biased data can impact the decision rendered by the learning models. Many models are gaining traction for such interpretations. Now, LIME and SHAP are commonly used, in which are two state of the art open source explainable techniques. However, their outputs represent some different results. In this context, this study introduces a coupling technique of LIME and Shap, and demonstrates analysis possibilities on the decisions made by LightGBM and Keras models in classifying a transaction for fraudulence on the IEEE CIS dataset.

Experimental Comparison of Network Intrusion Detection Models Solving Imbalanced Data Problem (데이터의 불균형성을 제거한 네트워크 침입 탐지 모델 비교 분석)

  • Lee, Jong-Hwa;Bang, Jiwon;Kim, Jong-Wouk;Choi, Mi-Jung
    • KNOM Review
    • /
    • v.23 no.2
    • /
    • pp.18-28
    • /
    • 2020
  • With the development of the virtual community, the benefits that IT technology provides to people in fields such as healthcare, industry, communication, and culture are increasing, and the quality of life is also improving. Accordingly, there are various malicious attacks targeting the developed network environment. Firewalls and intrusion detection systems exist to detect these attacks in advance, but there is a limit to detecting malicious attacks that are evolving day by day. In order to solve this problem, intrusion detection research using machine learning is being actively conducted, but false positives and false negatives are occurring due to imbalance of the learning dataset. In this paper, a Random Oversampling method is used to solve the unbalance problem of the UNSW-NB15 dataset used for network intrusion detection. And through experiments, we compared and analyzed the accuracy, precision, recall, F1-score, training and prediction time, and hardware resource consumption of the models. Based on this study using the Random Oversampling method, we develop a more efficient network intrusion detection model study using other methods and high-performance models that can solve the unbalanced data problem.

The Development of Park Analysis Indicators and Current Status: A Case Study of Daejeon Metropolitan City (공원 분석 지표 개발 및 현황 분석: 대전광역시를 중심으로)

  • Hwang, Jae-Yeon;Gwak, Seung-Yeon;Kim, Sang-Kyu;Park, Min-Ju
    • Land and Housing Review
    • /
    • v.13 no.1
    • /
    • pp.99-112
    • /
    • 2022
  • There is growing significance in securing urban parks and enhancing their accessibility due to irrational residential developments and apartment construction. Accordingly, Daejeon Metropolitan City has carried out urban park management projects to improve the quality of parks and create new parks. Daejeon Metropolitan City generates and manages park data for the purpose of management by the administrative district. However, these datasets take different forms in each administrative district. This study integrates the park data in Daejeon, generated by administrative districts, into the same format and generates geographic information data with the area information of each park for analysis. Analysis results show that urban parks are severely imbalanced across administrative districts, requiring new policy measures. In addition, by normalizing the park analysis results and, then, creating their rankings, this study compares them with the actual park information in detail to confirm the soundness of the dataset. The analysis results provide implications to improve the management of urban parks. This study proposes integrated datasets and the continued management of them in each administrative district by including essential data that can feature the objective information of the parks along with park evaluation indicators based on previous studies.