• 제목/요약/키워드: Least absolute shrinkage and selection operator

검색결과 49건 처리시간 0.022초

비정상 자기회귀모형에서의 벌점화 추정 기법에 대한 연구 (Model selection for unstable AR process via the adaptive LASSO)

  • 나옥경
    • 응용통계연구
    • /
    • 제32권6호
    • /
    • pp.909-922
    • /
    • 2019
  • 벌점화 추정 기법 중 adaptive LASSO 방법은 모형 선택과 모수 추정을 동시에 할 수 있는 유명한 방법으로 이미 정상 자기회귀모형에서 연구된 적이 있다. 본 논문에서는 이를 확장하여 확률보행과정과 같은 비정상 자기회귀모형에서 adaptive LASSO 추정량이 갖는 성질을 모의실험을 통해 연구하였다. 다만 비정상 자기회귀모형에서는 단위근의 존재 여부를 판단하는 것과 모형의 차수를 선택하는 것이 가장 중요하므로, 이를 위해 원 자기회귀모형이 아닌 ADF 검정에서 고려하는 회귀모형으로 변환하여 adaptive LASSO를 적용하였다. 일반적으로 Adaptive LASSO를 적용할 때 조절모수의 선택이 가장 중요한 문제이며, 본 논문에서는 교차검증, AIC, BIC 세 가지 방법을 이용하여 조절모수를 선택하였다. 모의실험 결과를 보면, 이 중에서 BIC가 최소가 되도록 선택한 조절모수에 대응되는 adaptive LASSO 추정량이 단위근의 존재 여부를 잘 판단할 뿐만 아니라 자기회귀모형의 차수 또한 비교적 정확하게 선택함을 확인할 수 있다.

평균-분산 가속화 실패시간 모형에서 벌점화 변수선택 (Penalized variable selection in mean-variance accelerated failure time models)

  • 권지훈;하일도
    • 응용통계연구
    • /
    • 제34권3호
    • /
    • pp.411-425
    • /
    • 2021
  • 가속화 실패시간모형은 로그 생존시간과 공변량간의 선형적 관계를 묘사해 준다. 가속화 실패시간모형에서 생존시간의 평균뿐만 아니라 변동성에도 영향을 미치는 공변량 효과를 추론하는 것은 흥미가 있다. 이를 위해 생존시간의 평균뿐만 아니라 분산을 모형화 하는 것이 필요하며, 이러한 모형을 평균-분산 가속화 실패시간모형이라 부른다. 본 논문에서는 벌점 가능도함수를 이용하여 평균-분산 가속화 실패시간모형에서 회귀모수에 대한 변수선택 절차를 제안한다. 여기서 벌점함수로서 LASSO, ALASSO, SCAD 그리고 HL (계층가능도)와 같은 네 가지 벌점함수를 연구한다. 제안된 변수선택 절차를 통해 중요한 공변량의 선택 뿐만 아니라 회귀모수의 추정을 동시에 제공할 수 있다. 제안된 방법의 성능은 모의실험을 통해 평가하고, 하나의 임상 예제자료를 통해 제안된 방법을 예증하고자 한다.

경제지표를 활용한 다중선형회귀 모델 기반 국제 휘발유 가격 예측 (A study of Predicting International Gasoline Prices based on Multiple Linear Regression with Economic Indicators)

  • 한명은;김지연;이현희;김세인;박민서
    • 문화기술의 융합
    • /
    • 제10권1호
    • /
    • pp.159-164
    • /
    • 2024
  • 국내 석유 시장은 국제 석유 가격의 변동에 매우 민감하기 때문에 그 변동성에 대한 파악과 대처가 중요하다. 특히, 높은 소비량을 보이는 휘발유의 가격이 어떠한 요인에 인해 변화하는지 명확하게 파악하는 것이 필요하다. 국제 휘발유 가격은 휘발유 수급, 지정학적 사건, 미국 달러화 가치 변동 등 글로벌 요인에 영향을 받는다. 그러나 기존의 연구들은 휘발유의 수급에만 초점에 맞추어 진행하였다는 한계가 존재한다. 본 연구에서는 다양한 머신러닝 기반의 회귀 모델을 활용하여 거시적 경제지표와 국제 휘발유 가격 간의 인과관계를 탐색한다. 첫째, 다양한 세계 경제지표 데이터를 수집한다. 둘째, 데이터 전처리를 진행한다. 셋째, 다중선형회귀, Ridge 회귀, Lasso(Least Absolute Shrinkage and Selection Operator) 회귀 모델을 활용하여 모델링한다. 실험 결과, 테스트 데이터 셋에서 다중선형회귀 모델이 가장 높은 정확도(97.3%)를 보였다. 우리는 국제 휘발유 가격의 예측은 국내 경제 안정성과 에너지 정책 결정에 도움이 될 수 있을 것으로 기대한다.

Risk Prediction Using Genome-Wide Association Studies on Type 2 Diabetes

  • Choi, Sungkyoung;Bae, Sunghwan;Park, Taesung
    • Genomics & Informatics
    • /
    • 제14권4호
    • /
    • pp.138-148
    • /
    • 2016
  • The success of genome-wide association studies (GWASs) has enabled us to improve risk assessment and provide novel genetic variants for diagnosis, prevention, and treatment. However, most variants discovered by GWASs have been reported to have very small effect sizes on complex human diseases, which has been a big hurdle in building risk prediction models. Recently, many statistical approaches based on penalized regression have been developed to solve the "large p and small n" problem. In this report, we evaluated the performance of several statistical methods for predicting a binary trait: stepwise logistic regression (SLR), least absolute shrinkage and selection operator (LASSO), and Elastic-Net (EN). We first built a prediction model by combining variable selection and prediction methods for type 2 diabetes using Affymetrix Genome-Wide Human SNP Array 5.0 from the Korean Association Resource project. We assessed the risk prediction performance using area under the receiver operating characteristic curve (AUC) for the internal and external validation datasets. In the internal validation, SLR-LASSO and SLR-EN tended to yield more accurate predictions than other combinations. During the external validation, the SLR-SLR and SLR-EN combinations achieved the highest AUC of 0.726. We propose these combinations as a potentially powerful risk prediction model for type 2 diabetes.

Effect of outliers on the variable selection by the regularized regression

  • Jeong, Junho;Kim, Choongrak
    • Communications for Statistical Applications and Methods
    • /
    • 제25권2호
    • /
    • pp.235-243
    • /
    • 2018
  • Many studies exist on the influence of one or few observations on estimators in a variety of statistical models under the "large n, small p" setup; however, diagnostic issues in the regression models have been rarely studied in a high dimensional setup. In the high dimensional data, the influence of observations is more serious because the sample size n is significantly less than the number variables p. Here, we investigate the influence of observations on the least absolute shrinkage and selection operator (LASSO) estimates, suggested by Tibshirani (Journal of the Royal Statistical Society, Series B, 73, 273-282, 1996), and the influence of observations on selected variables by the LASSO in the high dimensional setup. We also derived an analytic expression for the influence of the k observation on LASSO estimates in simple linear regression. Numerical studies based on artificial data and real data are done for illustration. Numerical results showed that the influence of observations on the LASSO estimates and the selected variables by the LASSO in the high dimensional setup is more severe than that in the usual "large n, small p" setup.

Tracing the breeding farm of domesticated pig using feature selection (Sus scrofa)

  • Kwon, Taehyung;Yoon, Joon;Heo, Jaeyoung;Lee, Wonseok;Kim, Heebal
    • Asian-Australasian Journal of Animal Sciences
    • /
    • 제30권11호
    • /
    • pp.1540-1549
    • /
    • 2017
  • Objective: Increasing food safety demands in the animal product market have created a need for a system to trace the food distribution process, from the manufacturer to the retailer, and genetic traceability is an effective method to trace the origin of animal products. In this study, we successfully achieved the farm tracing of 6,018 multi-breed pigs, using single nucleotide polymorphism (SNP) markers strictly selected through least absolute shrinkage and selection operator (LASSO) feature selection. Methods: We performed farm tracing of domesticated pig (Sus scrofa) from SNP markers and selected the most relevant features for accurate prediction. Considering multi-breed composition of our data, we performed feature selection using LASSO penalization on 4,002 SNPs that are shared between breeds, which also includes 179 SNPs with small between-breed difference. The 100 highest-scored features were extracted from iterative simulations and then evaluated using machine-leaning based classifiers. Results: We selected 1,341 SNPs from over 45,000 SNPs through iterative LASSO feature selection, to minimize between-breed differences. We subsequently selected 100 highest-scored SNPs from iterative scoring, and observed high statistical measures in classification of breeding farms by cross-validation only using these SNPs. Conclusion: The study represents a successful application of LASSO feature selection on multi-breed pig SNP data to trace the farm information, which provides a valuable method and possibility for further researches on genetic traceability.

Intelligent System for the Prediction of Heart Diseases Using Machine Learning Algorithms with Anew Mixed Feature Creation (MFC) technique

  • Rawia Elarabi;Abdelrahman Elsharif Karrar;Murtada El-mukashfi El-taher
    • International Journal of Computer Science & Network Security
    • /
    • 제23권5호
    • /
    • pp.148-162
    • /
    • 2023
  • Classification systems can significantly assist the medical sector by allowing for the precise and quick diagnosis of diseases. As a result, both doctors and patients will save time. A possible way for identifying risk variables is to use machine learning algorithms. Non-surgical technologies, such as machine learning, are trustworthy and effective in categorizing healthy and heart-disease patients, and they save time and effort. The goal of this study is to create a medical intelligent decision support system based on machine learning for the diagnosis of heart disease. We have used a mixed feature creation (MFC) technique to generate new features from the UCI Cleveland Cardiology dataset. We select the most suitable features by using Least Absolute Shrinkage and Selection Operator (LASSO), Recursive Feature Elimination with Random Forest feature selection (RFE-RF) and the best features of both LASSO RFE-RF (BLR) techniques. Cross-validated and grid-search methods are used to optimize the parameters of the estimator used in applying these algorithms. and classifier performance assessment metrics including classification accuracy, specificity, sensitivity, precision, and F1-Score, of each classification model, along with execution time and RMSE the results are presented independently for comparison. Our proposed work finds the best potential outcome across all available prediction models and improves the system's performance, allowing physicians to diagnose heart patients more accurately.

Improvement of inspection system for common crossings by track side monitoring and prognostics

  • Sysyn, Mykola;Nabochenko, Olga;Kovalchuk, Vitalii;Gruen, Dimitri;Pentsak, Andriy
    • Structural Monitoring and Maintenance
    • /
    • 제6권3호
    • /
    • pp.219-235
    • /
    • 2019
  • Scheduled inspections of common crossings are one of the main cost drivers of railway maintenance. Prognostics and health management (PHM) approach and modern monitoring means offer many possibilities in the optimization of inspections and maintenance. The present paper deals with data driven prognosis of the common crossing remaining useful life (RUL) that is based on an inertial monitoring system. The problem of scheduled inspections system for common crossings is outlined and analysed. The proposed analysis of inertial signals with the maximal overlap discrete wavelet packet transform (MODWPT) and Shannon entropy (SE) estimates enable to extract the spectral features. The relevant features for the acceleration components are selected with application of Lasso (Least absolute shrinkage and selection operator) regularization. The features are fused with time domain information about the longitudinal position of wheels impact and train velocities by multivariate regression. The fused structural health (SH) indicator has a significant correlation to the lifetime of crossing. The RUL prognosis is performed on the linear degradation stochastic model with recursive Bayesian update. Prognosis testing metrics show the promising results for common crossing inspection scheduling improvement.

Lasso 모델을 이용한 건강상태 및 근로환경 만족도 영향 요인 연구 (Investigating Influential Factors on Health Status and Job Satisfaction Using Lasso Modeling)

  • 권보성;엄성원;정기효
    • 대한안전경영과학회지
    • /
    • 제26권3호
    • /
    • pp.101-106
    • /
    • 2024
  • The health and working conditions of employees have become increasingly important issues in modern society. In recent years, there has been a continuous rise in problems related to the deterioration of workers' alth, which seriously affects their safety and overall quality of life. Although existing research has investigated various factors affecting workers' health and working conditions, there is still a lack of studies that scientifically analyze and identify key variables from the vast number of factors. This study employs the Lasso (Least Absolute Shrinkage and Selection Operator) technique to mathematically analyze the key variables influencing workers' health status and satisfaction with their working environment. Lasso is a technique used in machine learning to identify a small number of variables that impact the dependent variable among a large set of variables, thereby reducing model complexity and improving predictive accuracy. The results of the study can be utilized in efficiently improving workers' health and working environments by focusing on a smaller set of impactful variables.

코로나19 발생의 지역사회 위험요인 분석 (Exploration of Community Risk Factors for COVID-19 Incidence in Korea)

  • 심보람;박명배
    • 보건행정학회지
    • /
    • 제32권1호
    • /
    • pp.45-52
    • /
    • 2022
  • Background: There are regional variations in the incidence of coronavirus disease 2019 (COVID-19), which means that some regions are more exposed to the risk of COVID-19 than others. Therefore, this study aims to investigate regional variations in the incidence of COVID-19 in Korea and identify risk factors associated with the incidence of COVID-19 using community-level data. Methods: This study was conducted at the districts (si·gun·gu) level in Korea. Data of COVID-19 incidence by districts were collected from the official website of each province. Data was also obtained from the Korean Statistical Information Service and the Community Health Survey; socio-demographic factor, transmission pathway, healthcare resource, and factor in response to COVID-19. Community risk factors that drive the incidence of COVID-19 were selected using a least absolute shrinkage and selection operator regression. Results: As of June 2021, the incidence of COVID-19 differed by more than 80 times between districts. Among the candidate factors, sex ratio, population aged 20-29, local financial independence, population density, diabetes prevalence, and failure to comply with the quarantine rules were significantly associated with COVID-19 incidence. Conclusion: This study suggests setting COVID-19 quarantine policy and allocating resources, considering the community risk factors. Protecting vulnerable groups should be a high priority for these policies.