• 제목/요약/키워드: Variable selection

검색결과 885건 처리시간 0.022초

신용평점화에서 벌점화를 이용한 절단값 선택 (Cutpoint Selection via Penalization in Credit Scoring)

  • 진슬기;김광래;박창이
    • 응용통계연구
    • /
    • 제25권2호
    • /
    • pp.261-267
    • /
    • 2012
  • 신용평점표(credit scorecard) 작성시 각 특성변수(characteristic variable)들을 몇 개의 속성(attribute)들로 나누고 각 속성에 적절한 가중치를 부여하게 된다. 이 과정을 성김화(coarse classi cation)라 한다. 특성변수들을 속성들로 나눌 때 그 기준이 되는 절단값(cutpoint)을 선택해야 한다. 본 논문에서는 벌점화(penalization) 기반의 절단값 선택법을 제안한다. 또한 여러가지 모의실험과 실제 신용자료의 분석을 통하여 제안된 방법과 기존의 절단값 선택법인 스플라인 분류 기계 (Koo 등, 2009)의 성능을 비교한다.

교통사고모형 개발에서의 함수식 도출 방법론에 관한 연구 (Methodology for Determining Functional Forms in Developing Statistical Collision Models)

  • 백종대;험머 조셉
    • 한국도로학회논문집
    • /
    • 제14권5호
    • /
    • pp.189-199
    • /
    • 2012
  • PURPOSES: The purpose of this study is to propose a new methodology for developing statistical collision models and to show the validation results of the methodology. METHODS: A new modeling method of introducing variables into the model one by one in a multiplicative form is suggested. A method for choosing explanatory variables to be introduced into the model is explained. A method for determining functional forms for each explanatory variable is introduced as well as a parameter estimating procedure. A model selection method is also dealt with. Finally, the validation results is provided to demonstrate the efficacy of the final models developed using the method suggested in this study. RESULTS: According to the results of the validation for the total and injury collisions, the predictive powers of the models developed using the method suggested in this study were better than those of generalized linear models for the same data. CONCLUSIONS: Using the methodology suggested in this study, we could develop better statistical collision models having better predictive powers. This was because the methodology enabled us to find the relationships between dependant variable and each explanatory variable individually and to find the functional forms for the relationships which can be more likely non-linear.

QSO Selections Using Time Variability and Machine Learning

  • 김대원;;변용익
    • 천문학회보
    • /
    • 제36권2호
    • /
    • pp.64-64
    • /
    • 2011
  • We present a new quasi-stellar object (QSO) selection algorithm using a Support Vector Machine, a supervised classification method, on a set of extracted time series features including period, amplitude, color, and autocorrelation value. We train a model that separates QSOs from variable stars, non-variable stars, and microlensing events using 58 known QSOs, 1629 variable stars, and 4288 non-variables in the MAssive Compact Halo Object (MACHO) database as a training set. To estimate the efficiency and the accuracy of the model, we perform a cross-validation test using the training set. The test shows that the model correctly identifies ~80% of known QSOs with a 25% false-positive rate. The majority of the false positives are Be stars. We applied the trained model to the MACHO Large Magellanic Cloud (LMC) data set, which consists of 40 million lightcurves, and found 1620 QSO candidates. During the selection, none of the 33,242 known MACHO variables were misclassified as QSO candidates. In order to estimate the true false-positive rate, we crossmatched the candidates with astronomical catalogs including the Spitzer Surveying the Agents of a Galaxy's Evolution (SAGE) LMC catalog and a few X-ray catalogs. The results further suggest that the majority of the candidates, more than 70%, are QSOs.

  • PDF

Probabilistic penalized principal component analysis

  • Park, Chongsun;Wang, Morgan C.;Mo, Eun Bi
    • Communications for Statistical Applications and Methods
    • /
    • 제24권2호
    • /
    • pp.143-154
    • /
    • 2017
  • A variable selection method based on probabilistic principal component analysis (PCA) using penalized likelihood method is proposed. The proposed method is a two-step variable reduction method. The first step is based on the probabilistic principal component idea to identify principle components. The penalty function is used to identify important variables in each component. We then build a model on the original data space instead of building on the rotated data space through latent variables (principal components) because the proposed method achieves the goal of dimension reduction through identifying important observed variables. Consequently, the proposed method is of more practical use. The proposed estimators perform as the oracle procedure and are root-n consistent with a proper choice of regularization parameters. The proposed method can be successfully applied to high-dimensional PCA problems with a relatively large portion of irrelevant variables included in the data set. It is straightforward to extend our likelihood method in handling problems with missing observations using EM algorithms. Further, it could be effectively applied in cases where some data vectors exhibit one or more missing values at random.

고차원 범주형 자료를 위한 비지도 연관성 기반 범주형 변수 선택 방법 (Association-based Unsupervised Feature Selection for High-dimensional Categorical Data)

  • 이창기;정욱
    • 품질경영학회지
    • /
    • 제47권3호
    • /
    • pp.537-552
    • /
    • 2019
  • Purpose: The development of information technology makes it easy to utilize high-dimensional categorical data. In this regard, the purpose of this study is to propose a novel method to select the proper categorical variables in high-dimensional categorical data. Methods: The proposed feature selection method consists of three steps: (1) The first step defines the goodness-to-pick measure. In this paper, a categorical variable is relevant if it has relationships among other variables. According to the above definition of relevant variables, the goodness-to-pick measure calculates the normalized conditional entropy with other variables. (2) The second step finds the relevant feature subset from the original variables set. This step decides whether a variable is relevant or not. (3) The third step eliminates redundancy variables from the relevant feature subset. Results: Our experimental results showed that the proposed feature selection method generally yielded better classification performance than without feature selection in high-dimensional categorical data, especially as the number of irrelevant categorical variables increase. Besides, as the number of irrelevant categorical variables that have imbalanced categorical values is increasing, the difference in accuracy between the proposed method and the existing methods being compared increases. Conclusion: According to experimental results, we confirmed that the proposed method makes it possible to consistently produce high classification accuracy rates in high-dimensional categorical data. Therefore, the proposed method is promising to be used effectively in high-dimensional situation.

호텔 미쉐린가이드 레스토랑의 선택속성이 고객행동의도에 미치는 영향 -호텔 브랜드이미지 조절효과 중심- (Effect of Hotel Michelin Restaurant's Selection Attributes on Customer Behavioral Intention - Focused on Moderating Role of the Hotel Brand Image -)

  • 양동휘;임종우
    • 한국콘텐츠학회논문지
    • /
    • 제21권9호
    • /
    • pp.322-332
    • /
    • 2021
  • 본 연구는 호텔 미쉐린가이드 레스토랑의 선택속성이 고객행동의도 간의 영향 관계와 호텔 브랜드이미지를 변수를 투입함으로서 조절효과가 있는지에 대한 영향관계를 규명하고자 하였다. 최근 서울지역 호텔 내에 입점하고 있는 호텔 미쉐린 레스토랑을 경험해본 고객을 대상으로 편의표본추출을 사용하였으며, 2020년 7월 1일부터 부터 약 60일간 진행되었다. 선행연구를 통해 구성된 설문도구는 서울지역 호텔 내의 위탁 운영 중인 미쉐린 레스토랑을 경험해본 고객을 대상으로 배포하였으며, 수집된 유효표본 287부를 SPSS 22.0을 사용하여 통계처리 하였다. 본 연구의 실증분석결과 선택속성의 요인 중 물리적환경, 음식품질, 서비스품질, 편리성은 고객행동의도 간에 유의한 정(+)의 영향이 있는 것으로 나타났으나, 가격공정성은 영향관계가 없는 것으로 나타났다. 마지막으로 선택속성과 고객행동의도 간의 상호작용변수를 투입함으로서 물리적환경과 서비스품질 변수에 조절효과가 있음을 발견하였다.

The Relationship between Hospital Selection by Employer and Disabilities in Occupational Accidents in Korea

  • Ahn, Joonho;Jang, Min;Yoo, Hyoungseob;Kim, Hyoung-Ryoul
    • Safety and Health at Work
    • /
    • 제13권3호
    • /
    • pp.279-285
    • /
    • 2022
  • Background; In the event of an industrial accident, the appropriate choice of hospital is important for worker health and prognosis. This study investigates whether the choice of hospital by the employer in the case of industrial accidents affects the prognosis of injured employees. Methods; Data from the 2018 Panel Study of Workers' Compensation Insurance in Korea were used in an unmatched case-controlled study. The exposure variable is "hospital selection by an employer," and the outcome variable is 'worker's disability." Odds ratios (ORs) were estimated by modified Poisson regression and adjusted for age, gender, underlying disease, injury severity, and workplace size and stratified by industrial classification. The group at increased risk was analyzed and stratified by age, gender, and area. Results; In the construction industry, hospital selection by the employer was significantly associated with increased risk of disability (adjusted OR 1.26; 95% confidence interval [CI]; 1.20-1.32) and severe disability (adjusted OR 1.38; 95% CI; 1.08-1.76) among the injured. Female and younger workers not living in the Seoul capital area were more at risk of disability and severe disability than those living in the Seoul capital area. Conclusions; Hospital selection by employers affects the prognosis of workers injured in an industrial accident. For protecting workers' health and safety, workplace emergency medical systems should be improved, and the selection of appropriate hospitals to supply treatment should be reviewed.

생체 의학 정보 수집이 가능한 실리콘 비드용 가변적인 속도 클록 데이터 복원 회로 설계 (A Design of Variable Rate Clock and Data Recovery Circuit for Biomedical Silicon Bead)

  • 조성훈;이동수;박형구;이강윤
    • 한국산업정보학회논문지
    • /
    • 제20권4호
    • /
    • pp.39-45
    • /
    • 2015
  • 이 논문은 블라인드 오버샘플링(Blind Oversampling) 기법을 이용한 가변적인 속도 클록 데이터 복원 회로 설계에 관한 내용을 제시하고 있다. 클록 데이터 복원 회로는 기본적으로 클록 복원과 데이터 복원 회로로 구성되어 있다. 클록 복원 회로는 넓은 범위를 가지는 전압 제어 발진기(Wide Range VCO)와 밴드 선택(Band Selection) 기법을 복합적으로 사용하여 구현하였고 데이터 복원 회로는 머저리티 보팅(Majority Voting) 방식을 이용하는 디지털 회로로 제안하여 저전력 및 작은 면적으로 구성하였다. 넓은 범위를 가지는 전압 제어 발진기와 데이터 복원회로를 디지털로 구현함으로써 저전력으로 가변적인 속도 클록 데이터 복원회로 구현이 가능하였다. 설계된 회로는 약 10bps에서 2Mbps 범위에서 동작한다. 전체 전력 소비는 1MHz 클록에서 약 4.4mW의 전력을 소비한다. 공급전압은 1.2V 이며 제작된 코어의 면적은 $120{\mu}m{\times}75{\mu}m$ 이고 $0.13{\mu}m$ CMOS 공정에서 제작되었다.

랜덤 포리스트를 이용한 비제어 급성 출혈성 쇼크의 흰쥐에서의 생존 예측 (A Survival Prediction Model of Rats in Uncontrolled Acute Hemorrhagic Shock Using the Random Forest Classifier)

  • 최준열;김성권;구정모;김덕원
    • 대한의용생체공학회:의공학회지
    • /
    • 제33권3호
    • /
    • pp.148-154
    • /
    • 2012
  • Hemorrhagic shock is a primary cause of deaths resulting from injury in the world. Although many studies have tried to diagnose accurately hemorrhagic shock in the early stage, such attempts were not successful due to compensatory mechanisms of humans. The objective of this study was to construct a survival prediction model of rats in acute hemorrhagic shock using a random forest (RF) model. Heart rate (HR), mean arterial pressure (MAP), respiration rate (RR), lactate concentration (LC), and peripheral perfusion (PP) measured in rats were used as input variables for the RF model and its performance was compared with that of a logistic regression (LR) model. Before constructing the models, we performed 5-fold cross validation for RF variable selection, and forward stepwise variable selection for the LR model to examine which variables were important for the models. For the LR model, sensitivity, specificity, accuracy, and area under the receiver operating characteristic curve (ROC-AUC) were 0.83, 0.95, 0.88, and 0.96, respectively. For the RF models, sensitivity, specificity, accuracy, and AUC were 0.97, 0.95, 0.96, and 0.99, respectively. In conclusion, the RF model was superior to the LR model for survival prediction in the rat model.