• Title/Summary/Keyword: 범주형 변수모형

Search Result 54, Processing Time 0.029 seconds

Trimmed LAD Estimators for Multidimensional Contingency Tables (분할표 분석을 위한 절사 LAD 추정량과 최적 절사율 결정)

  • Choi, Hyun-Jip
    • The Korean Journal of Applied Statistics
    • /
    • v.23 no.6
    • /
    • pp.1235-1243
    • /
    • 2010
  • This study proposes a trimmed LAD(least absolute deviation) estimators for multi-dimensional contingency tables and suggests an algorithm to estimate it. In addition, a method to determine the trimming quantity of the estimators is suggested. A Monte Carlo study shows that the propose method yields a better trimming rate and coverage rate than the previously suggest method based on the determinant of the covariance matrix.

The guideline for choosing the right-size of tree for boosting algorithm (부스팅 트리에서 적정 트리사이즈의 선택에 관한 연구)

  • Kim, Ah-Hyoun;Kim, Ji-Hyun;Kim, Hyun-Joong
    • Journal of the Korean Data and Information Science Society
    • /
    • v.23 no.5
    • /
    • pp.949-959
    • /
    • 2012
  • This article is to find the right size of decision trees that performs better for boosting algorithm. First we defined the tree size D as the depth of a decision tree. Then we compared the performance of boosting algorithm with different tree sizes in the experiment. Although it is an usual practice to set the tree size in boosting algorithm to be small, we figured out that the choice of D has a significant influence on the performance of boosting algorithm. Furthermore, we found out that the tree size D need to be sufficiently large for some dataset. The experiment result shows that there exists an optimal D for each dataset and choosing the right size D is important in improving the performance of boosting. We also tried to find the model for estimating the right size D suitable for boosting algorithm, using variables that can explain the nature of a given dataset. The suggested model reveals that the optimal tree size D for a given dataset can be estimated by the error rate of stump tree, the number of classes, the depth of a single tree, and the gini impurity.

Categorical Prediction and Improvement Plan of Snow Damage Estimation using Random Forest (랜덤포레스트를 이용한 대설피해액에 대한 범주형 예측 및 개선방안 검토)

  • Lee, Hyeong Joo;Chung, Gunhui
    • Journal of Wetlands Research
    • /
    • v.21 no.2
    • /
    • pp.157-162
    • /
    • 2019
  • Recently, the occurrence of unusual heavy snow and cold are increasing due to the unusual global climate change. In particular, the temperature dropped to minus 69 degrees Celsius in the United States on January 8, 2018. In Korea, on February 17, 2014, the auditorium building in Gyeongju Mauna Resort was collapsed due to the heavy snowfall. Because of the tragic accident many studies on the reduction of snow damage is being conducted, but it is difficult to predict the exact damage due to the lack of historical damage data, and uncertainty of meteorological data due to the long distance between the damaged area and the observatory. Therefore, in this study, available data were collected from factors that are thought to be corresponding to snow damage, and the amount of snow damage was estimated categorically using a random forest. At present, the prediction accuracy was not sufficient due to lack of historical damage data and changes of the design code for green houses. However, if accurate weather data are obtained in the affected areas. the accuracy of estimates would increase enough for being used for be the degree preparedness of disaster management.

Landslide Risk Assessment in Inje Using Logistic Regression Model (로지스틱 회귀분석을 이용한 인제군 산사태지역의 위험도 평가)

  • Lee, Hwan-Gil;Kim, Gi-Hong
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.30 no.3
    • /
    • pp.313-321
    • /
    • 2012
  • Korea has been continuously affected by landslides, as 70% of the land is covered by mountains and most of annual rainfall concentrates between June and September. Recently, abrupt climate change affects the increase of landslide occurrence. Gangwon region is especially suffered by landslide damages, because the most of the part is mountainous, steep, and having shallow soil. In this study, a landslide risk assessment model was developed by applying logistic regression to the various data of Duksan-ri, Inje-eup, Inje-gun, Gangwon-do, which has suffered massive landslide triggered by heavy rain in July 2006. The information collected from field investigation and aerial photos right after the landslide of study area were stored in GIS DB for analysis. Slope gradient entered in two ways-as categorical variable and as linear variable. Error matrix for each case was made, and developed model showed the classification accuracy of 81.4% and 81.9%, respectively.

Mixed-effects zero-inflated Poisson regression for analyzing the spread of COVID-19 in Daejeon (혼합효과 영과잉 포아송 회귀모형을 이용한 대전광역시 코로나 발생 동향 분석)

  • Kim, Gwanghee;Lee, Eunjee
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.3
    • /
    • pp.375-388
    • /
    • 2021
  • This paper aims to help prevent the spread of COVID-19 by analyzing confirmed cases of COVID-19 in Daejeon. A high volume of visitors, downtown areas, and psychological fatigue with prolonged social distancing were considered as risk factors associated with the spread of COVID-19. We considered the weekly confirmed cases in each administrative district as a response variable. Explanatory variables were the number of passengers getting off at a bus station in each administrative district and the elapsed time since the Korean government had imposed distancing in daily life. We employed a mixed-effects zero-inflated Poisson regression model because the number of cases was repeatedly measured with excess zero-count data. We conducted k-means clustering to identify three groups of administrative districts having different characteristics in terms of the number of bars, the population size, and the distance to the closest college. Considering that the number of confirmed cases might vary depending on districts' characteristics, the clustering information was incorporated as a categorical explanatory variable. We found that Covid-19 was more prevalent as population size increased and a district is downtown. As the number of passengers getting off at a downtown district increased, the confirmed cases significantly increased.

Long-term Streamflow Prediction for Integrated Real-time Water Management System (통합실시간 물관리 운영시스템을 위한 장기유량예측)

  • Kang Boosik;Rieu Seung Yup;Ko Ick-Hwan
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2005.05b
    • /
    • pp.1450-1454
    • /
    • 2005
  • 수자원관리에 있어서 미래시구간에 대한 유량예측은 수자원시스템운영자에게 있어서 의사결정에 결정적인 영향을 미치는 가장 중요한 요소 중의 하나이다. 효율적 물배분이나 발전 등의 이수활동을 위해서 최소 월단위 이상의 장기유량예측이 필요하며, 이를 위해서는 강우예측이 선행되어야 하는데, 본 연구에서는 통합 실시간 물관리 운영시스템을 위한 중장기 유량예측을 목표로 방법론을 제시하고자 한다. 중장기 유량예측을 수행하는 대표적인 방법 중의 하나는 앙상블 유량예측(ESP; Ensemble Streamflow Prediction) 기법이다. ESP란 현재의 유역상태를 초기조건으로 사용하고 과거의 온도나 강수 등의 시계열앙상블을 모형입력으로 이용해서 강우-유출모형을 통하여 유출량을 예측하는 기법이다. ESP는 결국 현재의 유역상태와 유역에서의 과거강우관측기록, 미래강우예측에 대한 정보를 조합하여 그에 따른 유출앙상블을 생산해 내게 된다. 유출앙상블은 각 앙상블 트레이스가 갖게 되는 가중치에 따라 확률분포를 달리 갖게 되고 경우에 따라서는 유량으로부터 2차적으로 유도되는 변수들의 확률분포로 전이되기도 한다. 기존의 ESP 이론은 미국 NWS의 범주형 확률예보를 근간으로 하고 있어, 이를 국내 환경에 그대로 적용시키기에 어려움이 있어 왔다. 따라서 본 연구에서는 국내 기상청의 월간 강수전망을 이용하고, 이러한 정보의 특성에 맞는 ESP기법을 제시하였다. 더 나아가 중장기 수자원운영을 위한 일단위 월강수시나리오 구성을 위해서 수치예보와 월강수전망을 조합하여 ESP를 사용하는 기법을 제시하였다.

  • PDF

Multivariate Analysis for Clinicians (임상의를 위한 다변량 분석의 실제)

  • Oh, Joo Han;Chung, Seok Won
    • Clinics in Shoulder and Elbow
    • /
    • v.16 no.1
    • /
    • pp.63-72
    • /
    • 2013
  • In medical research, multivariate analysis, especially multiple regression analysis, is used to analyze the influence of multiple variables on the result. Multiple regression analysis should include variables in the model and the problem of multi-collinearity as there are many variables as well as the basic assumption of regression analysis. The multiple regression model is expressed as the coefficient of determination, $R^2$ and the influence of independent variables on result as a regression coefficient, ${\beta}$. Multiple regression analysis can be divided into multiple linear regression analysis, multiple logistic regression analysis, and Cox regression analysis according to the type of dependent variables (continuous variable, categorical variable (binary logit), and state variable, respectively), and the influence of variables on the result is evaluated by regression coefficient${\beta}$, odds ratio, and hazard ratio, respectively. The knowledge of multivariate analysis enables clinicians to analyze the result accurately and to design the further research efficiently.

A Study on the Characteristics of Cyanobacteria in the Downstream of Nakdong River Considering the Meteorological Effects (기상학적 영향을 고려한 낙동강 하류 녹조 발생특성 연구)

  • Jung, Woo Suk;Kim, Young Do;Kim, Sung Eun;Ki, Seo Jin
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2020.06a
    • /
    • pp.110-110
    • /
    • 2020
  • 최근 낙동강유역에서는 여름철 폭염 및 가뭄의 영향으로 조류대경보가 발령되고 있으며, 급격한 수질환경적 변화가 이루어지고 있다. 본 연구대상유역인 낙동강에서도 가뭄으로 인해 녹조가 발생하여 조류경보가 발령되었다. 남조류의 대발생은 대량 번성 및 사멸에 따라 수체 내 산소 고갈 및 유기물 증가와 같은 문제를 야기하고 있다. 또한 남조류가 분비하는 독성물질 또한 수생태계와 인체에 유해하다. 그리고 인체에는 무해하다고 밝혀졌지만 수돗물 등에서 흙냄새와 같은 좋지 않은 냄새를 유발하는 냄새물인 지오스민, 2-MIB을 분비하여 정수공급체계의 악영향을 미친다. 본 연구대상 지점인 낙동강은 다기능 보 건설로 인해 하천 수심이 증가하고 유속이 느려지면서 정체성 수역 특성을 나타내고 있다. 이는 호소성 수역 특성을 나타내고 있음과 동시에 녹조발생과 같은 수질환경적 변화가 이루어지고 있다는 것을 의미한다. 본 연구에서 시각화 분석을 통해 낙동강 하류 남조류 발생현황을 분석하였으며, 랜덤포레스트를 이용하여 지점별 남조류 발생 주요 영향인자를 도출하였다. 조류경보제 발생 등급은 발령기준으로 관심, 위험, 대발생으로 구분된다. 학습데이터로 관심단계 기준인 남조류세포수 1,000 cell/mL 보다 작게 측정된 데이터들은 관심미만의 데이터로 Normal 등급으로 구분하였다. 구분된 발생등급을 범주형 변수로 설정하여 학습 데이터를 통해 모형을 구축하고 검증 데이터를 이용하여 모형 정확성을 평가하였다. 본 연구를 통해 조류발생 주요 영향인자를 도출하고 변수별 중요도를 평가를 통해 지점별 녹조 발생특성을 비교 분석하였다.

  • PDF

Performance Evaluation of Military Corps with Categorical Environmental Variables (범주형 환경변수를 고려한 부대성과평가 방법에 관한 연구 - DEA와 CCCA의 결합을 중심으로 -)

  • Lee, Kyung-Won;Park, Myung-Seop;Im, Jae-Poong
    • Journal of the military operations research society of Korea
    • /
    • v.32 no.1
    • /
    • pp.51-72
    • /
    • 2006
  • There are many occasions that the performance of a corps is influenced not only by its own efforts but by the commander of the next higher unit in a vertical organizational structure. When the direction of the commander in the next higher organization is different from that of the actual evaluation agency, the unit under evaluation may get rated lower than what it should deserve. This study suggests an alternative method to evaluate the performance of military units in the situation that there exist critical environmental factors which affect the performance. This method employes DEA, a non parametric method, and Constrained Canonical Correlation Analysis(CCCA), a parametric method which is used to estimate a efficient frontier with multiple dependent variables and constraints. This article also exploits a set of categorical environmental variables in the CCCA to improve the fairness of performance evaluation. It is shown that the introduction of the categorical variables helps evaluating the true performance of individual units such as battalions subordinated to different next higher commanders.

Denoising Self-Attention Network for Mixed-type Data Imputation (혼합형 데이터 보간을 위한 디노이징 셀프 어텐션 네트워크)

  • Lee, Do-Hoon;Kim, Han-Joon;Chun, Joonghoon
    • The Journal of the Korea Contents Association
    • /
    • v.21 no.11
    • /
    • pp.135-144
    • /
    • 2021
  • Recently, data-driven decision-making technology has become a key technology leading the data industry, and machine learning technology for this requires high-quality training datasets. However, real-world data contains missing values for various reasons, which degrades the performance of prediction models learned from the poor training data. Therefore, in order to build a high-performance model from real-world datasets, many studies on automatically imputing missing values in initial training data have been actively conducted. Many of conventional machine learning-based imputation techniques for handling missing data involve very time-consuming and cumbersome work because they are applied only to numeric type of columns or create individual predictive models for each columns. Therefore, this paper proposes a new data imputation technique called 'Denoising Self-Attention Network (DSAN)', which can be applied to mixed-type dataset containing both numerical and categorical columns. DSAN can learn robust feature expression vectors by combining self-attention and denoising techniques, and can automatically interpolate multiple missing variables in parallel through multi-task learning. To verify the validity of the proposed technique, data imputation experiments has been performed after arbitrarily generating missing values for several mixed-type training data. Then we show the validity of the proposed technique by comparing the performance of the binary classification models trained on imputed data together with the errors between the original and imputed values.