• Title/Summary/Keyword: lasso

Search Result 173, Processing Time 0.027 seconds

On sampling algorithms for imbalanced binary data: performance comparison and some caveats (불균형적인 이항 자료 분석을 위한 샘플링 알고리즘들: 성능비교 및 주의점)

  • Kim, HanYong;Lee, Woojoo
    • The Korean Journal of Applied Statistics
    • /
    • v.30 no.5
    • /
    • pp.681-690
    • /
    • 2017
  • Various imbalanced binary classification problems exist such as fraud detection in banking operations, detecting spam mail and predicting defective products. Several sampling methods such as over sampling, under sampling, SMOTE have been developed to overcome the poor prediction performance of binary classifiers when the proportion of one group is dominant. In order to overcome this problem, several sampling methods such as over-sampling, under-sampling, SMOTE have been developed. In this study, we investigate prediction performance of logistic regression, Lasso, random forest, boosting and support vector machine in combination with the sampling methods for binary imbalanced data. Four real data sets are analyzed to see if there is a substantial improvement in prediction performance. We also emphasize some precautions when the sampling methods are implemented.

Weighted L1-Norm Support Vector Machine for the Classification of Highly Imbalanced Data (불균형 자료의 분류분석을 위한 가중 L1-norm SVM)

  • Kim, Eunkyung;Jhun, Myoungshic;Bang, Sungwan
    • The Korean Journal of Applied Statistics
    • /
    • v.28 no.1
    • /
    • pp.9-21
    • /
    • 2015
  • The support vector machine has been successfully applied to various classification areas due to its flexibility and a high level of classification accuracy. However, when analyzing imbalanced data with uneven class sizes, the classification accuracy of SVM may drop significantly in predicting minority class because the SVM classifiers are undesirably biased toward the majority class. The weighted $L_2$-norm SVM was developed for the analysis of imbalanced data; however, it cannot identify irrelevant input variables due to the characteristics of the ridge penalty. Therefore, we propose the weighted $L_1$-norm SVM, which uses lasso penalty to select important input variables and weights to differentiate the misclassification of data points between classes. We demonstrate the satisfactory performance of the proposed method through simulation studies and a real data analysis.

Permutation test for a post selection inference of the FLSA (순열검정을 이용한 FLSA의 사후추론)

  • Choi, Jieun;Son, Won
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.6
    • /
    • pp.863-874
    • /
    • 2021
  • In this paper, we propose a post-selection inference procedure for the fused lasso signal approximator (FLSA). The FLSA finds underlying sparse piecewise constant mean structure by applying total variation (TV) semi-norm as a penalty term. However, it is widely known that this convex relaxation can cause asymptotic inconsistency in change points detection. As a result, there can remain false change points even though we try to find the best subset of change points via a tuning procedure. To remove these false change points, we propose a post-selection inference for the FLSA. The proposed procedure applies a permutation test based on CUSUM statistic. Our post-selection inference procedure is an extension of the permutation test of Antoch and Hušková (2001) which deals with single change point problems, to multiple change points detection problems in combination with the FLSA. Numerical study results show that the proposed procedure is better than naïve z-tests and tests based on the limiting distribution of CUSUM statistics.

Consumer behavior prediction using Airbnb web log data (에어비앤비(Airbnb) 웹 로그 데이터를 이용한 고객 행동 예측)

  • An, Hyoin;Choi, Yuri;Oh, Raeeun;Song, Jongwoo
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.3
    • /
    • pp.391-404
    • /
    • 2019
  • Customers' fixed characteristics have often been used to predict customer behavior. It has recently become possible to track customer web logs as customer activities move from offline to online. It has become possible to collect large amounts of web log data; however, the researchers only focused on organizing the log data or describing the technical characteristics. In this study, we predict the decision-making time until each customer makes the first reservation, using Airbnb customer data provided by the Kaggle website. This data set includes basic customer information such as gender, age, and web logs. We use various methodologies to find the optimal model and compare prediction errors for cases with web log data and without it. We consider six models such as Lasso, SVM, Random Forest, and XGBoost to explore the effectiveness of the web log data. As a result, we choose Random Forest as our optimal model with a misclassification rate of about 20%. In addition, we confirm that using web log data in our study doubles the prediction accuracy in predicting customer behavior compared to not using it.

Mean-shortfall optimization problem with perturbation methods (퍼터베이션 방법을 활용한 평균-숏폴 포트폴리오 최적화)

  • Won, Hayeon;Park, Seyoung
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.1
    • /
    • pp.39-56
    • /
    • 2021
  • Many researches have been done on portfolio optimization since Markowitz (1952) published a diversified investment model. Markowitz's mean-variance portfolio optimization problem is established under the assumption that the distribution of returns follows a normal distribution. However, in real life, the distribution of returns does not follow a normal distribution, and variance is not a robust statistic as it is heavily influenced by outliers. To overcome these potential issues, mean-shortfall portfolio model was proposed that utilized downside risk, shortfall, as a risk index. In this paper, we propose a perturbation method that uses the shortfall as a risk index of the portfolio. The proposed portfolio utilizes an adaptive Lasso to obtain a sparse and stable asset selection because it can reduce management and transaction costs. The proposed optimization is easily applicable as it can be computed using an efficient linear programming. In our real data analysis, we show the validity of the proposed perturbation method.

Feature selection and prediction modeling of drug responsiveness in Pharmacogenomics (약물유전체학에서 약물반응 예측모형과 변수선택 방법)

  • Kim, Kyuhwan;Kim, Wonkuk
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.2
    • /
    • pp.153-166
    • /
    • 2021
  • A main goal of pharmacogenomics studies is to predict individual's drug responsiveness based on high dimensional genetic variables. Due to a large number of variables, feature selection is required in order to reduce the number of variables. The selected features are used to construct a predictive model using machine learning algorithms. In the present study, we applied several hybrid feature selection methods such as combinations of logistic regression, ReliefF, TurF, random forest, and LASSO to a next generation sequencing data set of 400 epilepsy patients. We then applied the selected features to machine learning methods including random forest, gradient boosting, and support vector machine as well as a stacking ensemble method. Our results showed that the stacking model with a hybrid feature selection of random forest and ReliefF performs better than with other combinations of approaches. Based on a 5-fold cross validation partition, the mean test accuracy value of the best model was 0.727 and the mean test AUC value of the best model was 0.761. It also appeared that the stacking models outperform than single machine learning predictive models when using the same selected features.

A Modeling of Realtime Fuel Comsumption Prediction Using OBDII Data (OBDII 데이터 기반의 실시간 연료 소비량 예측 모델 연구)

  • Yang, Hee-Eun;Kim, Do-Hyun;Choe, Hoseop
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.10 no.2
    • /
    • pp.57-64
    • /
    • 2021
  • This study presents a method for realtime fuel consumption prediction using real data collected from OBDII. With the advent of the era of self-driving cars, electronic control units(ECU) are getting more complex, and various studies are being attempted to extract and analyze more accurate data from vehicles. But since ECU is getting more complex, it is getting harder to get the data from ECU. To solve this problem, the firmware was developed for acquiring accurate vehicle data in this study, which extracted 53,580 actual driving data sets from vehicles from January to February 2019. Using these data, the ensemble stacking technique was used to increase the accuracy of the realtime fuel consumption prediction model. In this study, Ridge, Lasso, XGBoost, and LightGBM were used as base models, and Ridge was used for meta model, and the predicted performance was MAE 0.011, RMSE 0.017.

An empirical evidence of inconsistency of the ℓ1 trend filtering in change point detection (1 추세필터의 변화점 식별에 있어서의 비일치성)

  • Yu, Donghyeon;Lim, Johan;Son, Won
    • The Korean Journal of Applied Statistics
    • /
    • v.35 no.3
    • /
    • pp.371-384
    • /
    • 2022
  • The fused LASSO signal approximator (FLSA) can be applied to find change points from the data having piecewise constant mean structure. It is well-known that the FLSA is inconsistent in change points detection. This inconsistency is due to a total-variation denoising penalty of the FLSA. ℓ1 trend filter, one of the popular tools for finding an underlying trend from data, can be used to identify change points of piecewise linear trends. Since the ℓ1 trend filter applies the sum of absolute values of slope differences, it can be inconsistent for change points recovery as the FLSA. However, there are few studies on the inconsistency of the ℓ1 trend filtering. In this paper, we demonstrate the inconsistency of the ℓ1 trend filtering with a numerical study.

Radiomics-based Biomarker Validation Study for Region Classification in 2D Prostate Cross-sectional Images (2D 전립선 단면 영상에서 영역 분류를 위한 라디오믹스 기반 바이오마커 검증 연구)

  • Jun Young, Park;Young Jae, Kim;Jisup, Kim;Kwang Gi, Kim
    • Journal of Biomedical Engineering Research
    • /
    • v.44 no.1
    • /
    • pp.25-32
    • /
    • 2023
  • Recognizing the size and location of prostate cancer is critical for prostate cancer diagnosis, treatment, and predicting prognosis. This paper proposes a model to classify the tumor region and normal tissue with cross-sectional visual images of prostatectomy tissue. We used specimen images of 44 prostate cancer patients who received prostatectomy at Gachon University Gil Hospital. A total of 289 prostate slice images consist of 200 slices including tumor region and 89 slices not including tumor region. Images were divided based on the presence or absence of tumor, and a total of 93 features from each slice image were extracted using Radiomics: 18 first order, 24 GLCM, 16 GLRLM, 16 GLSZM, 5 NGTDM, and 14 GLDM. We compared feature selection techniques such as LASSO, ANOVA, SFS, Ridge and RF, LR, SVM classifiers for the model's high performances. We evaluated the model's performance with AUC of the ROC curve. The results showed that the combination of feature selection techniques LASSO, Ridge, and classifier RF could be best with an AUC of 0.99±0.005.

Comparative Study of Data Preprocessing and ML&DL Model Combination for Daily Dam Inflow Prediction (댐 일유입량 예측을 위한 데이터 전처리와 머신러닝&딥러닝 모델 조합의 비교연구)

  • Youngsik Jo;Kwansue Jung
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2023.05a
    • /
    • pp.358-358
    • /
    • 2023
  • 본 연구에서는 그동안 수자원분야 강우유출 해석분야에 활용되었던 대표적인 머신러닝&딥러닝(ML&DL) 모델을 활용하여 모델의 하이퍼파라미터 튜닝뿐만 아니라 모델의 특성을 고려한 기상 및 수문데이터의 조합과 전처리(lag-time, 이동평균 등)를 통하여 데이터 특성과 ML&DL모델의 조합시나리오에 따른 일 유입량 예측성능을 비교 검토하는 연구를 수행하였다. 이를 위해 소양강댐 유역을 대상으로 1974년에서 2021년까지 축적된 기상 및 수문데이터를 활용하여 1) 강우, 2) 유입량, 3) 기상자료를 주요 영향변수(독립변수)로 고려하고, 이에 a) 지체시간(lag-time), b) 이동평균, c) 유입량의 성분분리조건을 적용하여 총 36가지 시나리오 조합을 ML&DL의 입력자료로 활용하였다. ML&DL 모델은 1) Linear Regression(LR), 2) Lasso, 3) Ridge, 4) SVR(Support Vector Regression), 5) Random Forest(RF), 6) LGBM(Light Gradient Boosting Model), 7) XGBoost의 7가지 ML방법과 8) LSTM(Long Short-Term Memory models), 9) TCN(Temporal Convolutional Network), 10) LSTM-TCN의 3가지 DL 방법, 총 10가지 ML&DL모델을 비교 검토하여 일유입량 예측을 위한 가장 적합한 데이터 조합 특성과 ML&DL모델을 성능평가와 함께 제시하였다. 학습된 모형의 유입량 예측 결과를 비교·분석한 결과, 소양강댐 유역에서는 딥러닝 중에서는 TCN모형이 가장 우수한 성능을 보였고(TCN>TCN-LSTM>LSTM), 트리기반 머신러닝중에서는 Random Forest와 LGBM이 우수한 성능을 보였으며(RF, LGBM>XGB), SVR도 LGBM수준의 우수한 성능을 나타내었다. LR, Lasso, Ridge 세가지 Regression모형은 상대적으로 낮은 성능을 보였다. 또한 소양강댐 댐유입량 예측에 대하여 강우, 유입량, 기상계열을 36가지로 조합한 결과, 입력자료에 lag-time이 적용된 강우계열의 조합 분석에서 세가지 Regression모델을 제외한 모든 모형에서 NSE(Nash-Sutcliffe Efficiency) 0.8이상(최대 0.867)의 성능을 보였으며, lag-time이 적용된 강우와 유입량계열을 조합했을 경우 NSE 0.85이상(최대 0.901)의 더 우수한 성능을 보였다.

  • PDF