• Title/Summary/Keyword: least absolute shrinkage and selection operator(LASSO)

Search Result 34, Processing Time 0.029 seconds

An Application of the Clustering Threshold Gradient Descent Regularization Method for Selecting Genes in Predicting the Survival Time of Lung Carcinomas

  • Lee, Seung-Yeoun;Kim, Young-Chul
    • Genomics & Informatics
    • /
    • v.5 no.3
    • /
    • pp.95-101
    • /
    • 2007
  • In this paper, we consider the variable selection methods in the Cox model when a large number of gene expression levels are involved with survival time. Deciding which genes are associated with survival time has been a challenging problem because of the large number of genes and relatively small sample size (n<

Model selection for unstable AR process via the adaptive LASSO (비정상 자기회귀모형에서의 벌점화 추정 기법에 대한 연구)

  • Na, Okyoung
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.6
    • /
    • pp.909-922
    • /
    • 2019
  • In this paper, we study the adaptive least absolute shrinkage and selection operator (LASSO) for the unstable autoregressive (AR) model. To identify the existence of the unit root, we apply the adaptive LASSO to the augmented Dickey-Fuller regression model, not the original AR model. We illustrate our method with simulations and a real data analysis. Simulation results show that the adaptive LASSO obtained by minimizing the Bayesian information criterion selects the order of the autoregressive model as well as the degree of differencing with high accuracy.

Drought forecasting over South Korea based on the teleconnected global climate variables

  • Taesam Lee;Yejin Kong;Sejeong Lee;Taegyun Kim
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2023.05a
    • /
    • pp.47-47
    • /
    • 2023
  • Drought occurs due to lack of water resources over an extended period and its intensity has been magnified globally by climate change. In recent years, drought over South Korea has also been intensed, and the prediction was inevitable for the water resource management and water industry. Therefore, drought forecasting over South Korea was performed in the current study with the following procedure. First, accumulated spring precipitation(ASP) driven by the 93 weather stations in South Korea was taken with their median. Then, correlation analysis was followed between ASP and Df4m, the differences of two pair of the global winter MSLP. The 37 Df4m variables with high correlations over 0.55 was chosen and sorted into three regions. The selected Df4m variables in the same region showed high similarity, leading the multicollinearity problem. To avoid this problem, a model that performs variable selection and model fitting at once, least absolute shrinkage and selection operator(LASSO) was applied. The LASSO model selected 5 variables which showed a good agreement of the predicted with the observed value, R2=0.72. Other models such as multiple linear regression model and ElasticNet were also performed, but did not present a performance as good as LASSO. Therefore, LASSO model can be an appropriate model to forecast spring drought over South Korea and can be used to mange water resources efficiently.

  • PDF

Effect of outliers on the variable selection by the regularized regression

  • Jeong, Junho;Kim, Choongrak
    • Communications for Statistical Applications and Methods
    • /
    • v.25 no.2
    • /
    • pp.235-243
    • /
    • 2018
  • Many studies exist on the influence of one or few observations on estimators in a variety of statistical models under the "large n, small p" setup; however, diagnostic issues in the regression models have been rarely studied in a high dimensional setup. In the high dimensional data, the influence of observations is more serious because the sample size n is significantly less than the number variables p. Here, we investigate the influence of observations on the least absolute shrinkage and selection operator (LASSO) estimates, suggested by Tibshirani (Journal of the Royal Statistical Society, Series B, 73, 273-282, 1996), and the influence of observations on selected variables by the LASSO in the high dimensional setup. We also derived an analytic expression for the influence of the k observation on LASSO estimates in simple linear regression. Numerical studies based on artificial data and real data are done for illustration. Numerical results showed that the influence of observations on the LASSO estimates and the selected variables by the LASSO in the high dimensional setup is more severe than that in the usual "large n, small p" setup.

Genomic Selection for Adjacent Genetic Markers of Yorkshire Pigs Using Regularized Regression Approaches

  • Park, Minsu;Kim, Tae-Hun;Cho, Eun-Seok;Kim, Heebal;Oh, Hee-Seok
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.27 no.12
    • /
    • pp.1678-1683
    • /
    • 2014
  • This study considers a problem of genomic selection (GS) for adjacent genetic markers of Yorkshire pigs which are typically correlated. The GS has been widely used to efficiently estimate target variables such as molecular breeding values using markers across the entire genome. Recently, GS has been applied to animals as well as plants, especially to pigs. For efficient selection of variables with specific traits in pig breeding, it is required that any such variable selection retains some properties: i) it produces a simple model by identifying insignificant variables; ii) it improves the accuracy of the prediction of future data; and iii) it is feasible to handle high-dimensional data in which the number of variables is larger than the number of observations. In this paper, we applied several variable selection methods including least absolute shrinkage and selection operator (LASSO), fused LASSO and elastic net to data with 47K single nucleotide polymorphisms and litter size for 519 observed sows. Based on experiments, we observed that the fused LASSO outperforms other approaches.

Multiple Group Testing Procedures for Analysis of High-Dimensional Genomic Data

  • Ko, Hyoseok;Kim, Kipoong;Sun, Hokeun
    • Genomics & Informatics
    • /
    • v.14 no.4
    • /
    • pp.187-195
    • /
    • 2016
  • In genetic association studies with high-dimensional genomic data, multiple group testing procedures are often required in order to identify disease/trait-related genes or genetic regions, where multiple genetic sites or variants are located within the same gene or genetic region. However, statistical testing procedures based on an individual test suffer from multiple testing issues such as the control of family-wise error rate and dependent tests. Moreover, detecting only a few of genes associated with a phenotype outcome among tens of thousands of genes is of main interest in genetic association studies. In this reason regularization procedures, where a phenotype outcome regresses on all genomic markers and then regression coefficients are estimated based on a penalized likelihood, have been considered as a good alternative approach to analysis of high-dimensional genomic data. But, selection performance of regularization procedures has been rarely compared with that of statistical group testing procedures. In this article, we performed extensive simulation studies where commonly used group testing procedures such as principal component analysis, Hotelling's $T^2$ test, and permutation test are compared with group lasso (least absolute selection and shrinkage operator) in terms of true positive selection. Also, we applied all methods considered in simulation studies to identify genes associated with ovarian cancer from over 20,000 genetic sites generated from Illumina Infinium HumanMethylation27K Beadchip. We found a big discrepancy of selected genes between multiple group testing procedures and group lasso.

Penalized variable selection in mean-variance accelerated failure time models (평균-분산 가속화 실패시간 모형에서 벌점화 변수선택)

  • Kwon, Ji Hoon;Ha, Il Do
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.3
    • /
    • pp.411-425
    • /
    • 2021
  • Accelerated failure time (AFT) model represents a linear relationship between the log-survival time and covariates. We are interested in the inference of covariate's effect affecting the variation of survival times in the AFT model. Thus, we need to model the variance as well as the mean of survival times. We call the resulting model mean and variance AFT (MV-AFT) model. In this paper, we propose a variable selection procedure of regression parameters of mean and variance in MV-AFT model using penalized likelihood function. For the variable selection, we study four penalty functions, i.e. least absolute shrinkage and selection operator (LASSO), adaptive lasso (ALASSO), smoothly clipped absolute deviation (SCAD) and hierarchical likelihood (HL). With this procedure we can select important covariates and estimate the regression parameters at the same time. The performance of the proposed method is evaluated using simulation studies. The proposed method is illustrated with a clinical example dataset.

Tracing the breeding farm of domesticated pig using feature selection (Sus scrofa)

  • Kwon, Taehyung;Yoon, Joon;Heo, Jaeyoung;Lee, Wonseok;Kim, Heebal
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.30 no.11
    • /
    • pp.1540-1549
    • /
    • 2017
  • Objective: Increasing food safety demands in the animal product market have created a need for a system to trace the food distribution process, from the manufacturer to the retailer, and genetic traceability is an effective method to trace the origin of animal products. In this study, we successfully achieved the farm tracing of 6,018 multi-breed pigs, using single nucleotide polymorphism (SNP) markers strictly selected through least absolute shrinkage and selection operator (LASSO) feature selection. Methods: We performed farm tracing of domesticated pig (Sus scrofa) from SNP markers and selected the most relevant features for accurate prediction. Considering multi-breed composition of our data, we performed feature selection using LASSO penalization on 4,002 SNPs that are shared between breeds, which also includes 179 SNPs with small between-breed difference. The 100 highest-scored features were extracted from iterative simulations and then evaluated using machine-leaning based classifiers. Results: We selected 1,341 SNPs from over 45,000 SNPs through iterative LASSO feature selection, to minimize between-breed differences. We subsequently selected 100 highest-scored SNPs from iterative scoring, and observed high statistical measures in classification of breeding farms by cross-validation only using these SNPs. Conclusion: The study represents a successful application of LASSO feature selection on multi-breed pig SNP data to trace the farm information, which provides a valuable method and possibility for further researches on genetic traceability.

Prediction of Quantitative Traits Using Common Genetic Variants: Application to Body Mass Index

  • Bae, Sunghwan;Choi, Sungkyoung;Kim, Sung Min;Park, Taesung
    • Genomics & Informatics
    • /
    • v.14 no.4
    • /
    • pp.149-159
    • /
    • 2016
  • With the success of the genome-wide association studies (GWASs), many candidate loci for complex human diseases have been reported in the GWAS catalog. Recently, many disease prediction models based on penalized regression or statistical learning methods were proposed using candidate causal variants from significant single-nucleotide polymorphisms of GWASs. However, there have been only a few systematic studies comparing existing methods. In this study, we first constructed risk prediction models, such as stepwise linear regression (SLR), least absolute shrinkage and selection operator (LASSO), and Elastic-Net (EN), using a GWAS chip and GWAS catalog. We then compared the prediction accuracy by calculating the mean square error (MSE) value on data from the Korea Association Resource (KARE) with body mass index. Our results show that SLR provides a smaller MSE value than the other methods, while the numbers of selected variables in each model were similar.

Efficient Compression Algorithm with Limited Resource for Continuous Surveillance

  • Yin, Ling;Liu, Chuanren;Lu, Xinjiang;Chen, Jiafeng;Liu, Caixing
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.10 no.11
    • /
    • pp.5476-5496
    • /
    • 2016
  • Energy efficiency of resource-constrained wireless sensor networks is critical in applications such as real-time monitoring/surveillance. To improve the energy efficiency and reduce the energy consumption, the time series data can be compressed before transmission. However, most of the compression algorithms for time series data were developed only for single variate scenarios, while in practice there are often multiple sensor nodes in one application and the collected data is actually multivariate time series. In this paper, we propose to compress the time series data by the Lasso (least absolute shrinkage and selection operator) approximation. We show that, our approach can be naturally extended for compressing the multivariate time series data. Our extension is novel since it constructs an optimal projection of the original multivariates where the best energy efficiency can be realized. The two algorithms are named by ULasso (Univariate Lasso) and MLasso (Multivariate Lasso), for which we also provide practical guidance for parameter selection. Finally, empirically evaluation is implemented with several publicly available real-world data sets from different application domains. We quantify the algorithm performance by measuring the approximation error, compression ratio, and computation complexity. The results show that ULasso and MLasso are superior to or at least equivalent to compression performance of LTC and PLAMlis. Particularly, MLasso can significantly reduce the smooth multivariate time series data, without breaking the major trends and important changes of the sensor network system.