• Title/Summary/Keyword: CTGAN

Search Result 9, Processing Time 0.027 seconds

Resolving CTGAN-based data imbalance for commercialization of public technology (공공기술 사업화를 위한 CTGAN 기반 데이터 불균형 해소)

  • Hwang, Chul-Hyun
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.1
    • /
    • pp.64-69
    • /
    • 2022
  • Commercialization of public technology is the transfer of government-led scientific and technological innovation and R&D results to the private sector, and is recognized as a key achievement driving economic growth. Therefore, in order to activate technology transfer, various machine learning methods are being studied to identify success factors or to match public technology with high commercialization potential and demanding companies. However, public technology commercialization data is in the form of a table and has a problem that machine learning performance is not high because it is in an imbalanced state with a large difference in success-failure ratio. In this paper, we present a method of utilizing CTGAN to resolve imbalances in public technology data in tabular form. In addition, to verify the effectiveness of the proposed method, a comparative experiment with SMOTE, a statistical approach, was performed using actual public technology commercialization data. In many experimental cases, it was confirmed that CTGAN reliably predicts public technology commercialization success cases.

A Study on the Optimization of Data Augmentation Ratio using CTGAN (CTGAN기반 데이터 증강 비율 최적화 연구)

  • Da-Hun Seong;Yujin Lim
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.11a
    • /
    • pp.327-330
    • /
    • 2023
  • 머신러닝과 딥러닝 모델의 사용이 급증함에 따라 충분한 데이터 확보의 중요성이 부각되고 있다. 이에 따라 생성 모델을 통한 데이터 증강 기술이 주목받고 있으나, 증강 데이터를 활용했을 때 학습의 성능 분석은 아직 부족하다. 따라서 본 연구에서는 데이터 증강 시나리오에 따라 증강 비율별 합성 데이터의 유용성을 조사하고자 한다. 본 연구에서는 테이블 데이터를 증강하는 것에 초점을 맞추었으며, 이를 위해 테이블 데이터를 합성할 때 유용한 성능을 보이는 딥러닝 모델 CTGAN을 활용하였다. 실험에서 데이터를 증강하는 두 가지 다른 시나리오를 고려한 결과, 두 시나리오에서 모두 실험에서 설정한 증강 비율까지의 합성 데이터가 유용한 결과를 보임을 확인할 수 있었다.

Study of oversampling algorithms for soil classifications by field velocity resistivity probe

  • Lee, Jong-Sub;Park, Junghee;Kim, Jongchan;Yoon, Hyung-Koo
    • Geomechanics and Engineering
    • /
    • v.30 no.3
    • /
    • pp.247-258
    • /
    • 2022
  • A field velocity resistivity probe (FVRP) can measure compressional waves, shear waves and electrical resistivity in boreholes. The objective of this study is to perform the soil classification through a machine learning technique through elastic wave velocity and electrical resistivity measured by FVRP. Field and laboratory tests are performed, and the measured values are used as input variables to classify silt sand, sand, silty clay, and clay-sand mixture layers. The accuracy of k-nearest neighbors (KNN), naive Bayes (NB), random forest (RF), and support vector machine (SVM), selected to perform classification and optimize the hyperparameters, is evaluated. The accuracies are calculated as 0.76, 0.91, 0.94, and 0.88 for KNN, NB, RF, and SVM algorithms, respectively. To increase the amount of data at each soil layer, the synthetic minority oversampling technique (SMOTE) and conditional tabular generative adversarial network (CTGAN) are applied to overcome imbalance in the dataset. The CTGAN provides improved accuracy in the KNN, NB, RF and SVM algorithms. The results demonstrate that the measured values by FVRP can classify soil layers through three kinds of data with machine learning algorithms.

A Comparative Study on Data Augmentation Using Generative Models for Robust Solar Irradiance Prediction

  • Jinyeong Oh;Jimin Lee;Daesungjin Kim;Bo-Young Kim;Jihoon Moon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.28 no.11
    • /
    • pp.29-42
    • /
    • 2023
  • In this paper, we propose a method to enhance the prediction accuracy of solar irradiance for three major South Korean cities: Seoul, Busan, and Incheon. Our method entails the development of five generative models-vanilla GAN, CTGAN, Copula GAN, WGANGP, and TVAE-to generate independent variables that mimic the patterns of existing training data. To mitigate the bias in model training, we derive values for the dependent variables using random forests and deep neural networks, enriching the training datasets. These datasets are integrated with existing data to form comprehensive solar irradiance prediction models. The experimentation revealed that the augmented datasets led to significantly improved model performance compared to those trained solely on the original data. Specifically, CTGAN showed outstanding results due to its sophisticated mechanism for handling the intricacies of multivariate data relationships, ensuring that the generated data are diverse and closely aligned with the real-world variability of solar irradiance. The proposed method is expected to address the issue of data scarcity by augmenting the training data with high-quality synthetic data, thereby contributing to the operation of solar power systems for sustainable development.

Conditional Variational Autoencoder-based Generative Model for Gene Expression Data Augmentation (유전자 발현량 데이터 증대를 위한 Conditional VAE 기반 생성 모델)

  • Hyunsu Bong;Minsik Oh
    • Journal of Broadcast Engineering
    • /
    • v.28 no.3
    • /
    • pp.275-284
    • /
    • 2023
  • Gene expression data can be utilized in various studies, including the prediction of disease prognosis. However, there are challenges associated with collecting enough data due to cost constraints. In this paper, we propose a gene expression data generation model based on Conditional Variational Autoencoder. Our results demonstrate that the proposed model generates synthetic data with superior quality compared to two other state-of-the-art models for gene expression data generation, namely the Wasserstein Generative Adversarial Network with Gradient Penalty based model and the structured data generation models CTGAN and TVAE.

Generating and Validating Synthetic Training Data for Predicting Bankruptcy of Individual Businesses

  • Hong, Dong-Suk;Baik, Cheol
    • Journal of information and communication convergence engineering
    • /
    • v.19 no.4
    • /
    • pp.228-233
    • /
    • 2021
  • In this study, we analyze the credit information (loan, delinquency information, etc.) of individual business owners to generate voluminous training data to establish a bankruptcy prediction model through a partial synthetic training technique. Furthermore, we evaluate the prediction performance of the newly generated data compared to the actual data. When using conditional tabular generative adversarial networks (CTGAN)-based training data generated by the experimental results (a logistic regression task), the recall is improved by 1.75 times compared to that obtained using the actual data. The probability that both the actual and generated data are sampled over an identical distribution is verified to be much higher than 80%. Providing artificial intelligence training data through data synthesis in the fields of credit rating and default risk prediction of individual businesses, which have not been relatively active in research, promotes further in-depth research efforts focused on utilizing such methods.

Credit Card Fraud Detection Based on SHAP Considering Time Sequences (시간대를 고려한 SHAP 기반의 신용카드 이상 거래 탐지)

  • Soyeon yang;Yujin Lim
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.05a
    • /
    • pp.370-372
    • /
    • 2023
  • 신용카드 부정 사용은 고객 및 기업의 신용과 재산에 막대한 손실을 미치고 있다. 이에 따라 금융사들은 이상금융거래탐지시스템을 도입하였으나 이상 거래 발생 여부를 지속적으로 모니터링하고 있기 때문에 시스템 유지에 많은 비용이 따른다. 따라서 본 논문에서는 컴퓨팅 리소스를 절약함과 동시에 성능 개선 효과를 보인 신용카드 이상 거래 탐지 알고리즘을 제안한다. CTGAN 을 활용하여 정상 거래와 이상 거래의 비율을 일부 완화하였고 XAI 기법인 SHAP 를 활용하여 유의미한 속성값을 선택하였다. 이것을 기반으로 LSTM Autoencoder를 사용하여 이상데이터를 탐지하였다. 그 결과 전통적인 비지도 학습 기법에 비해 제안 알고리즘이 우수한 성능을 보였음을 확인하였다.

A Study on the Prediction Model for Bioactive Components of Cnidium officinale Makino according to Climate Change using Machine Learning (머신러닝을 이용한 기후변화에 따른 천궁 생리 활성 성분 예측 모델 연구)

  • Hyunjo Lee;Hyun Jung Koo;Kyeong Cheol Lee;Won-Kyun Joo;Cheol-Joo Chae
    • Smart Media Journal
    • /
    • v.12 no.10
    • /
    • pp.93-101
    • /
    • 2023
  • Climate change has emerged as a global problem, with frequent temperature increases, droughts, and floods, and it is predicted that it will have a great impact on the characteristics and productivity of crops. Cnidium officinale is used not only as traditionally used herbal medicines, but also as various industrial raw materials such as health functional foods, natural medicines, and living materials, but productivity is decreasing due to threats such as continuous crop damage and climate change. Therefore, this paper proposes a model that can predict the physiologically active ingredient index according to the climate change scenario of Cnidium officinale, a representative medicinal crop vulnerable to climate change. In this paper, data was first augmented using the CTGAN algorithm to solve the problem of data imbalance in the collection of environment information, physiological reactions, and physiological active ingredient information. Column Shape and Column Pair Trends were used to measure augmented data quality, and overall quality of 88% was achieved on average. In addition, five models RF, SVR, XGBoost, AdaBoost, and LightBGM were used to predict phenol and flavonoid content by dividing them into ground and underground using augmented data. As a result of model evaluation, the XGBoost model showed the best performance in predicting the physiological active ingredients of the sacrum, and it was confirmed to be about twice as accurate as the SVR model.

Automatic Augmentation Technique of an Autoencoder-based Numerical Training Data (오토인코더 기반 수치형 학습데이터의 자동 증강 기법)

  • Jeong, Ju-Eun;Kim, Han-Joon;Chun, Jong-Hoon
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.22 no.5
    • /
    • pp.75-86
    • /
    • 2022
  • This study aims to solve the problem of class imbalance in numerical data by using a deep learning-based Variational AutoEncoder and to improve the performance of the learning model by augmenting the learning data. We propose 'D-VAE' to artificially increase the number of records for a given table data. The main features of the proposed technique go through discretization and feature selection in the preprocessing process to optimize the data. In the discretization process, K-means are applied and grouped, and then converted into one-hot vectors by one-hot encoding technique. Subsequently, for memory efficiency, sample data are generated with Variational AutoEncoder using only features that help predict with RFECV among feature selection techniques. To verify the performance of the proposed model, we demonstrate its validity by conducting experiments by data augmentation ratio.