• Title/Summary/Keyword: Imbalanced data

Search Result 153, Processing Time 0.028 seconds

Conditional Generative Adversarial Network based Collaborative Filtering Recommendation System (Conditional Generative Adversarial Network(CGAN) 기반 협업 필터링 추천 시스템)

  • Kang, Soyi;Shin, Kyung-shik
    • Journal of Intelligence and Information Systems
    • /
    • v.27 no.3
    • /
    • pp.157-173
    • /
    • 2021
  • With the development of information technology, the amount of available information increases daily. However, having access to so much information makes it difficult for users to easily find the information they seek. Users want a visualized system that reduces information retrieval and learning time, saving them from personally reading and judging all available information. As a result, recommendation systems are an increasingly important technologies that are essential to the business. Collaborative filtering is used in various fields with excellent performance because recommendations are made based on similar user interests and preferences. However, limitations do exist. Sparsity occurs when user-item preference information is insufficient, and is the main limitation of collaborative filtering. The evaluation value of the user item matrix may be distorted by the data depending on the popularity of the product, or there may be new users who have not yet evaluated the value. The lack of historical data to identify consumer preferences is referred to as data sparsity, and various methods have been studied to address these problems. However, most attempts to solve the sparsity problem are not optimal because they can only be applied when additional data such as users' personal information, social networks, or characteristics of items are included. Another problem is that real-world score data are mostly biased to high scores, resulting in severe imbalances. One cause of this imbalance distribution is the purchasing bias, in which only users with high product ratings purchase products, so those with low ratings are less likely to purchase products and thus do not leave negative product reviews. Due to these characteristics, unlike most users' actual preferences, reviews by users who purchase products are more likely to be positive. Therefore, the actual rating data is over-learned in many classes with high incidence due to its biased characteristics, distorting the market. Applying collaborative filtering to these imbalanced data leads to poor recommendation performance due to excessive learning of biased classes. Traditional oversampling techniques to address this problem are likely to cause overfitting because they repeat the same data, which acts as noise in learning, reducing recommendation performance. In addition, pre-processing methods for most existing data imbalance problems are designed and used for binary classes. Binary class imbalance techniques are difficult to apply to multi-class problems because they cannot model multi-class problems, such as objects at cross-class boundaries or objects overlapping multiple classes. To solve this problem, research has been conducted to convert and apply multi-class problems to binary class problems. However, simplification of multi-class problems can cause potential classification errors when combined with the results of classifiers learned from other sub-problems, resulting in loss of important information about relationships beyond the selected items. Therefore, it is necessary to develop more effective methods to address multi-class imbalance problems. We propose a collaborative filtering model using CGAN to generate realistic virtual data to populate the empty user-item matrix. Conditional vector y identify distributions for minority classes and generate data reflecting their characteristics. Collaborative filtering then maximizes the performance of the recommendation system via hyperparameter tuning. This process should improve the accuracy of the model by addressing the sparsity problem of collaborative filtering implementations while mitigating data imbalances arising from real data. Our model has superior recommendation performance over existing oversampling techniques and existing real-world data with data sparsity. SMOTE, Borderline SMOTE, SVM-SMOTE, ADASYN, and GAN were used as comparative models and we demonstrate the highest prediction accuracy on the RMSE and MAE evaluation scales. Through this study, oversampling based on deep learning will be able to further refine the performance of recommendation systems using actual data and be used to build business recommendation systems.

Experimental Comparison of Network Intrusion Detection Models Solving Imbalanced Data Problem (데이터의 불균형성을 제거한 네트워크 침입 탐지 모델 비교 분석)

  • Lee, Jong-Hwa;Bang, Jiwon;Kim, Jong-Wouk;Choi, Mi-Jung
    • KNOM Review
    • /
    • v.23 no.2
    • /
    • pp.18-28
    • /
    • 2020
  • With the development of the virtual community, the benefits that IT technology provides to people in fields such as healthcare, industry, communication, and culture are increasing, and the quality of life is also improving. Accordingly, there are various malicious attacks targeting the developed network environment. Firewalls and intrusion detection systems exist to detect these attacks in advance, but there is a limit to detecting malicious attacks that are evolving day by day. In order to solve this problem, intrusion detection research using machine learning is being actively conducted, but false positives and false negatives are occurring due to imbalance of the learning dataset. In this paper, a Random Oversampling method is used to solve the unbalance problem of the UNSW-NB15 dataset used for network intrusion detection. And through experiments, we compared and analyzed the accuracy, precision, recall, F1-score, training and prediction time, and hardware resource consumption of the models. Based on this study using the Random Oversampling method, we develop a more efficient network intrusion detection model study using other methods and high-performance models that can solve the unbalanced data problem.

Ensemble Learning-Based Prediction of Good Sellers in Overseas Sales of Domestic Books and Keyword Analysis of Reviews of the Good Sellers (앙상블 학습 기반 국내 도서의 해외 판매 굿셀러 예측 및 굿셀러 리뷰 키워드 분석)

  • Do Young Kim;Na Yeon Kim;Hyon Hee Kim
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.12 no.4
    • /
    • pp.173-178
    • /
    • 2023
  • As Korean literature spreads around the world, its position in the overseas publishing market has become important. As demand in the overseas publishing market continues to grow, it is essential to predict future book sales and analyze the characteristics of books that have been highly favored by overseas readers in the past. In this study, we proposed ensemble learning based prediction model and analyzed characteristics of the cumulative sales of more than 5,000 copies classified as good sellers published overseas over the past 5 years. We applied the five ensemble learning models, i.e., XGBoost, Gradient Boosting, Adaboost, LightGBM, and Random Forest, and compared them with other machine learning algorithms, i.e., Support Vector Machine, Logistic Regression, and Deep Learning. Our experimental results showed that the ensemble algorithm outperforms other approaches in troubleshooting imbalanced data. In particular, the LightGBM model obtained an AUC value of 99.86% which is the best prediction performance. Among the features used for prediction, the most important feature is the author's number of overseas publications, and the second important feature is publication in countries with the largest publication market size. The number of evaluation participants is also an important feature. In addition, text mining was performed on the four book reviews that sold the most among good-selling books. Many reviews were interested in stories, characters, and writers and it seems that support for translation is needed as many of the keywords of "translation" appear in low-rated reviews.

A Classification Model for Customs Clearance Inspection Results of Imported Aquatic Products Using Machine Learning Techniques (머신러닝 기법을 활용한 수입 수산물 통관검사결과 분류 모델)

  • Ji Seong Eom;Lee Kyung Hee;Wan-Sup Cho
    • The Journal of Bigdata
    • /
    • v.8 no.1
    • /
    • pp.157-165
    • /
    • 2023
  • Seafood is a major source of protein in many countries and its consumption is increasing. In Korea, consumption of seafood is increasing, but self-sufficiency rate is decreasing, and the importance of safety management is increasing as the amount of imported seafood increases. There are hundreds of species of aquatic products imported into Korea from over 110 countries, and there is a limit to relying only on the experience of inspectors for safety management of imported aquatic products. Based on the data, a model that can predict the customs inspection results of imported aquatic products is developed, and a machine learning classification model that determines the non-conformity of aquatic products when an import declaration is submitted is created. As a result of customs inspection of imported marine products, the nonconformity rate is less than 1%, which is very low imbalanced data. Therefore, a sampling method that can complement these characteristics was comparatively studied, and a preprocessing method that can interpret the classification result was applied. Among various machine learning-based classification models, Random Forest and XGBoost showed good performance. The model that predicts both compliance and non-conformance well as a result of the clearance inspection is the basic random forest model to which ADASYN and one-hot encoding are applied, and has an accuracy of 99.88%, precision of 99.87%, recall of 99.89%, and AUC of 99.88%. XGBoost is the most stable model with all indicators exceeding 90% regardless of oversampling and encoding type.

VRIFA: A Prediction and Nonlinear SVM Visualization Tool using LRBF kernel and Nomogram (VRIFA: LRBF 커널과 Nomogram을 이용한 예측 및 비선형 SVM 시각화도구)

  • Kim, Sung-Chul;Yu, Hwan-Jo
    • Journal of Korea Multimedia Society
    • /
    • v.13 no.5
    • /
    • pp.722-729
    • /
    • 2010
  • Prediction problems are widely used in medical domains. For example, computer aided diagnosis or prognosis is a key component in a CDSS (Clinical Decision Support System). SVMs with nonlinear kernels like RBF kernels, have shown superior accuracy in prediction problems. However, they are not preferred by physicians for medical prediction problems because nonlinear SVMs are difficult to visualize, thus it is hard to provide intuitive interpretation of prediction results to physicians. Nomogram was proposed to visualize SVM classification models. However, it cannot visualize nonlinear SVM models. Localized Radial Basis Function (LRBF) was proposed which shows comparable accuracy as the RBF kernel while the LRBF kernel is easier to interpret since it can be linearly decomposed. This paper presents a new tool named VRIFA, which integrates the nomogram and LRBF kernel to provide users with an interactive visualization of nonlinear SVM models, VRIFA visualizes the internal structure of nonlinear SVM models showing the effect of each feature, the magnitude of the effect, and the change at the prediction output. VRIFA also performs nomogram-based feature selection while training a model in order to remove noise or redundant features and improve the prediction accuracy. The area under the ROC curve (AUC) can be used to evaluate the prediction result when the data set is highly imbalanced. The tool can be used by biomedical researchers for computer-aided diagnosis and risk factor analysis for diseases.

Recent Changes in Sex Ratio at Birth and Simulations on Sex-Selective Reproductive Behavior: With a Special Focus on Youngnam Region (출생성비의 최근 변화와 시뮬레이션을 통한 성선별 출산행위의 추정: 영남 지역을 중심으로)

  • Kim, Doo-Sub
    • Korea journal of population studies
    • /
    • v.34 no.1
    • /
    • pp.159-178
    • /
    • 2011
  • Korea has been widely recognized as the most successful country for reversal of the rise in sex ratio at birth (from the mid-1980s to the mid-1990s) in a short period of time. However, unusually high sex ratios at birth are still observed in most regions as parity increases. Given that imbalanced sex ratios at high birth orders are mostly due to son-selective abortion, it still remains questionable whether son-selective reproductive behavior has vanished in Korea. The main purpose of this study is to analyze the pattern of changing trends and socioeconomic differentials in sex ratio at birth. Micro-data from birth registration for 2009 are utilized. Attention is focused on analyzing sex ratios at birth in Youngnam region according to age of mother, parity, educational attainment of parents, and occupation of parents. A series of simulations are also conducted in this paper to show how prenatal sex screening and son-selective abortion have affected the level of sex ratio at birth for years 1994, 2005 and 2009.

Improvement Issues of Personal Information Protection Laws through Meta-Analysis (메타분석을 통한 개인정보보호법의 개선과제)

  • Cho, Myunggeun;Lee, Hwansoo
    • Journal of Digital Convergence
    • /
    • v.15 no.9
    • /
    • pp.1-14
    • /
    • 2017
  • As we enter the era of big data, the value of personal information is becoming ever more important. However, personal information protection laws in Korea have several issues. Furthermore, existing research are limited in their ability to facilitate a comprehensive understanding of measures to improve personal information protection laws. Accordingly, this study analyzes improvements to be made in the current personal information protection laws based on existing research. A total of 39 research articles discussing the problems of the personal information protection law were selected and analyzed by applying the meta - analysis technique. According to the results, the various issues such as the meaning and scope of personal information, the role and obligations of relevant parties, provision of personal information to third parties, and redundant and imbalanced regulations in special acts in each field. that exist in the current personal information protection laws were confirmed. This study contributes to the improvement of inconsistency between information protection laws and related special laws in each field in practice. Academically, it will contribute to understanding the problems of th law from the macro perspective and suggesting the integrated improvement ways of the law.

Regulation of Matrix Metalloproteinase-1 Expression by the Homeodomain Transcription Factor Caudal in Drosophila Intestine (초파리 장조직에서 Caudal 전사조절인자에 의한 matrix metalloproteinase-1 발현 조절)

  • Lee, Shin-Hae;Hwang, Mi-Sun;Choi, Yoon-Jeong;Kim, Young-Shin;Yoo, Mi-Ae
    • Journal of Life Science
    • /
    • v.22 no.12
    • /
    • pp.1600-1607
    • /
    • 2012
  • The matrix metalloproteinase (MMP) family plays essential roles in physiological processes such as embryonic development, angiogenesis, wound healing, and tissue homeostasis as a consequence of MMPr capacity for breaking down many types of extracellular matrix proteins. Imbalanced regulation of MMP expression can also lead to pathological conditions such as tumor progression. We recently reported that the Drosophila Mmp1 gene is highly expressed in the digestive tract and is required for the maintenance of intestinal homeostasis such as by restriction of uncontrolled intestinal stem cell proliferation. However, the regulatory mechanisms of MMP gene expression in the intestine remain unclear. In this study, we determined that the expression of Mmp1 is regulated by the homeodomain transcription factor Caudal. Experiments using the targeted expression of Caudal under the regulation of Gal4-UAS system indicated that endogenous Caudal is required for the Mmp1 gene expression in the adult Drosophila intestine and that exogenous Caudal induces Mmp1 expression. Transient transfection experiments indicated that Caudal can activate the promoter activity of Mmp1 and that several putative Caudal binding sites in the 5'-flanking region of the Mmp1 gene may be critical to the upregulation by Caudal. Our data suggest that Mmp1 is one of the target genes of Caudal in physiological normal condition and in tumorigenesis.

A Regional Source-Receptor Analysis for Air Pollutants in Seoul Metropolitan Area (수도권지역에서의 권역간 대기오염물질 상호영향 연구)

  • Lee, Yong-Mi;Hong, Sung-Chul;Yoo, Chul;Kim, Jeong-Soo;Hong, Ji-Hyung;Park, Il-Su
    • Journal of Environmental Science International
    • /
    • v.19 no.5
    • /
    • pp.591-605
    • /
    • 2010
  • This study were to simulate major criteria air pollutants and estimate regional source-receptor relationship using air quality prediction model (TAPM ; The Air Pollution Model) in the Seoul Metropolitan area. Source-receptor relationship was estimated by contribution of each region to other regions and region itself through dividing the Seoul metropolitan area into five regions. According to administrative boundary, region I and region II were Seoul and Incheon in order. Gyeonggi was divided into three regions by directions like southern(region III), northern(IV) and eastern(V) area. Gridded emissions ($1km{\times}1km$) by Clean Air Pollicy Support System (CAPSS) of National Institute of Environmental Research (NIER) was prepared for TAPM simulation. The operational weather prediction system, Regional Data Assimilation and Prediction System (RDAPS) operated by the Korean Meteorology Administration (KMA) was used for the regional weather forecasting with 30km grid resolution. Modeling period was 5 continuous days for each season with non-precipitation. The results showed that region I was the most air-polluted area and it was 3~4 times more polluted region than other regions for $NO_2$, $SO_2$ and PM10. Contributions of $SO_2$ $NO_2$ and PM10 to region I, II and III were more than 50 percent for their own sources. However region IV and V were mostly affected by sources of region I, II and III. When emissions of all regions were assumed to reduce 10 and 20 percent separately, air pollution of each region was reduced linearly and the contributions of reduction scenario were similar to those of base case. As input emissions were reduced according to different ratio - region I 40 percent, region II and III 20 percent, region IV and V 10 percent, air pollutions of region I and III were decreased remarkably. The contributions to region I, II, III were also reduced for their own sources. However, region I, II and III affected more regions IV and V. Shortly, graded reduction of emission could be more effective to control air pollution in emission imbalanced area.

Bike Insurance Fraud Detection Model Using Balanced Randomforest Algorithm (균형 랜덤 포레스트를 이용한 이륜차 보험사기 적발 모형 개발)

  • Kim, Seunghoon;Lee, Soo Il;Kim, Tae ho
    • Journal of Digital Convergence
    • /
    • v.20 no.2
    • /
    • pp.241-250
    • /
    • 2022
  • Due to the COVID-19 pandemic, with increased 'untact' services and with unstable household economy, the bike insurance fraud is expected to surge. Moreover, the fraud methodology gets complicated. However, the fraud detection model for bike insurance is absent. we deal with the issue of skewed class distribution and reflect the criterion of fraud detection expert. We utilize a balanced random-forest algorithm to develop an efficient bike insurance fraud detection model. As a result, while the predictive performance of balanced random-forest model is superior than it of non-balanced model. There is no significant difference between the variables used by the experts and the confirmatory models. The important variables to detect frauds are turned out to be age and gender of driver, correspondence between insured and driver, the amount of self-repairing claim, and the amount of bodily injury liability.