• Title/Summary/Keyword: 과소표본추출

Search Result 12, Processing Time 0.019 seconds

A Study on Injury Severity Prediction for Car-to-Car Traffic Accidents (차대차 교통사고에 대한 상해 심각도 예측 연구)

  • Ko, Changwan;Kim, Hyeonmin;Jeong, Young-Seon;Kim, Jaehee
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.19 no.4
    • /
    • pp.13-29
    • /
    • 2020
  • Automobiles have long been an essential part of daily life, but the social costs of car traffic accidents exceed 9% of the national budget of Korea. Hence, it is necessary to establish prevention and response system for car traffic accidents. In order to present a model that can classify and predict the degree of injury in car traffic accidents, we used big data analysis techniques of K-nearest neighbor, logistic regression analysis, naive bayes classifier, decision tree, and ensemble algorithm. The performances of the models were analyzed by using the data on the nationwide traffic accidents over the past three years. In particular, considering the difference in the number of data among the respective injury severity levels, we used down-sampling methods for the group with a large number of samples to enhance the accuracy of the classification of the models and then verified the statistical significance of the models using ANOVA.

Reliability of self-reported data for prevalence and health life expectancy studies: comparison with sample cohort DB of National Health Insurance Services (자가 응답식 자료에 근거한 유병률 및 건강기대수명 연구의 신뢰도 분석: 건강보험 표본코호트 DB와의 비교)

  • Kwon, Tae Yeon;Park, Yousung
    • The Korean Journal of Applied Statistics
    • /
    • v.29 no.7
    • /
    • pp.1329-1346
    • /
    • 2016
  • Korea Health Panel (KHP) data and Korea National Health and Nutrition Examination Survey (KNHANES) data are collected by self-assess and self-report for individual's health status and medical use. Previous studies have claimed that the reliability for prevalence rates and health life expectancies obtained from these data should be validated. National Health Insurance Services in Korea recently released a sample cohort DB that contain all data related to the use of medical facilities for all entire Korea citizens. It has been shown that disease-specific prevalence rates calculated from these data are representative and reliable for the entire population. In this paper, we evaluate the reliability of prevalence rates derived from self-reported data such as KHP and KNHANES by comparing to the prevalence rates from the sample cohort DB. We found that both KHP and KNHANES underestimate prevalence rates and in turn overestimate health life expectancies. Moreover, the general trends of health life expectancies might be distorted (except for the sample cohort DB) because of sampling and non-sampling errors.

Sample Distortion in Social Surveys and Effects of Weighting Adjustment: A Study of 18 Cases (사회조사에서 표본의 왜곡과 가중치 보정의 결과: 18개 사례연구)

  • Huh, Myung-Hoe;Yoon, Young-A;Lee, Yong-Goo
    • Survey Research
    • /
    • v.5 no.2
    • /
    • pp.31-48
    • /
    • 2004
  • We collected and analyzed 18 social surveys to assess the quality of samples with respect to region, gender, age-band, education level and occupation. We found in our samples that highly educated people and house wives are over-represented whereas low educated people, self-employed/blue collars and white collars are under-represented. To correct such sample distortions, we applied the iterative proportional weighting or the raking to our samples. We observed sizable changes in survey results. Also, the effective sample sizes were shrunken up to 20%-40%, that could be interpreted as the necessity of larger samples to meet the claimed sampling error limits.

  • PDF

Using the Sample IQR for Calculating Sample Size (표본크기 결정을 위한 IQR의 활용방법)

  • 홍종선;김현태;윤상호;정민정
    • The Korean Journal of Applied Statistics
    • /
    • v.16 no.1
    • /
    • pp.181-193
    • /
    • 2003
  • Without a sample standard deviation for an estimator of the population standard deviation u in a sample size computations, we often use some functions of a sample range (R) or interquartile range (IQR) by an estimator of $\sigma$. In order to avoid under-powered studies, these estimates must have a high probability of being greater than or equal to $\sigma$. In this paper, these probabilities of being greater than or equal to $\sigma$ are estimated for IQR for various parents distributions, and are compared with the probabilities for R/4 (Browne 2001). Alternative divisors (K) are explored and discussed for which the probabilities of R/K and IQR/K being greater than or equal to $\sigma$ is at least 95%.

Systematic Bias of Telephone Surveys: Meta Analysis of 2007 Presidential Election Polls (전화조사의 체계적 편향 - 2007년 대통령선거 여론조사들에 대한 메타분석 -)

  • Kim, Se-Yong;Huh, Myung-Hoe
    • The Korean Journal of Applied Statistics
    • /
    • v.22 no.2
    • /
    • pp.375-385
    • /
    • 2009
  • For 2007 Korea presidential election, most polls by telephone surveys indicated Lee Myung-Bak led the second runner-up Jung Dong-Young by certain margin. The margin between two candidates can be estimated accurately by averaging individual poll results, provided there exists no systematic bias in telephone surveys. Most Korean telephone surveys via telephone directory are based on quota samples, with the region, the gender and the age-band as quota variables. Thus the surveys may result in certain systematic bias due to unbalanced factors inherent in quota sampling. The aim of this study is to answer the following questions by the analytic methods adopted in Huh et al. (2004): Question 1. Wasn't there systematic bias in estimates of support rates. Question 2. If yes, what was the source of the bias? To answer the questions, we collected eighteen surveys administered during the election campaign period and applied the iterated proportional weighting (the rim weighting) to the last eleven surveys to obtain the balance in five factors - region, gender, age, occupation and education level. We found that the support rate of Lee Myung-Bak was over-estimated consistently by 1.4%P and that of Jung Dong-Young was underestimated by 0.6%P, resulting in the over-estimation of the margin by 2.0%P. By investigating the Lee Myung-Bak bias with logistic regression models, we conclude that it originated from the under-representation of less educated class and/or the over-representation of house wives in telephone samples.

Weighting Effect on the Weighted Mean in Finite Population (유한모집단에서 가중평균에 포함된 가중치의 효과)

  • Kim, Kyu-Seong
    • Survey Research
    • /
    • v.7 no.2
    • /
    • pp.53-69
    • /
    • 2006
  • Weights can be made and imposed in both sample design stage and analysis stage in a sample survey. While in design stage weights are related with sample data acquisition quantities such as sample selection probability and response rate, in analysis stage weights are connected with external quantities, for instance population quantities and some auxiliary information. The final weight is the product of all weights in both stage. In the present paper, we focus on the weight in analysis stage and investigate the effect of such weights imposed on the weighted mean when estimating the population mean. We consider a finite population with a pair of fixed survey value and weight in each unit, and suppose equal selection probability designs. Under the condition we derive the formulas of the bias as well as mean square error of the weighted mean and show that the weighted mean is biased and the direction and amount of the bias can be explained by the correlation between survey variate and weight: if the correlation coefficient is positive, then the weighted mein over-estimates the population mean, on the other hand, if negative, then under-estimates. Also the magnitude of bias is getting larger when the correlation coefficient is getting greater. In addition to theoretical derivation about the weighted mean, we conduct a simulation study to show quantities of the bias and mean square errors numerically. In the simulation, nine weights having correlation coefficient with survey variate from -0.2 to 0.6 are generated and four sample sizes from 100 to 400 are considered and then biases and mean square errors are calculated in each case. As a result, in the case or 400 sample size and 0.55 correlation coefficient, the amount or squared bias of the weighted mean occupies up to 82% among mean square error, which says the weighted mean might be biased very seriously in some cases.

  • PDF

Difference in Severity of Acute Rejection Grading between Superfical Cortex and Deep Cortex in Renal Allograft Biopsies

  • Lee, Su-Jin;Kim, Young-Ki;Kim, Kee-Hyuck
    • Childhood Kidney Diseases
    • /
    • v.11 no.2
    • /
    • pp.152-160
    • /
    • 2007
  • Twenty-six renal allograft biopsies which showed acute rejection and had renal capsule and medulla in the same specimen were selected in order to compare the severity of acute rejection between superficial cortex, deep cortex and medulla. Disregarding the mid cortical region, the superficial cortex was considered as being one-third of the distance from the renal capsule to the medulla and the deep cortex as being that one-third of the cortex which was adjacent to the medulla. Using semiquantitative histologic analysis the following parameters were compared in superficial cortex, deep cortex, and medulla: interstitial inflammation, edema, tubulitis, and acute tubulointerstitial rejection grade. Also, the presence of lymphocyte activation and polymorphonuclear leukocytes was evaluated. Significantly greater histologic changes of acute rejection were found in the deep cortex vs. supeficial cortex for the following parameters: interstitial inflammation(P=0.013), edema (P=0.023) and tubulointerstitial rejection grade(P=0.016). These findings support the view that biopsies in which deep cortex is not included may result in underestimation of the severity of renal allograft rejection.

  • PDF

Friedewald-Estimated Versus Directly Measured LDL-Cholesterol: KNHANES 2009-2010 (LDL-콜레스테롤의 Friedewald 계산값과 실측값 비교: 국민건강영양조사 2009-2010)

  • Jang, Sungok;Lee, Jongseok
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.16 no.8
    • /
    • pp.5492-5500
    • /
    • 2015
  • Low-density lipoprotein cholesterol (LDL-C) is a major modifiable risk factor for cardio- cerebrovascular disease. In clinical practice, however, it is primarily calculated using the Friedewald formula as a cost-effective method. The aim of this study was to compare Friedewald-estimated and directly measured LDL-C values and assess the concordance in guideline LDL-C risk classification between the two methods. The data were derived from the 2009 and 2010 Korea National Health and Nutrition Survey (KNHANES). Analysis was done for 4,319 subjects with lipid panels-total cholesterol (TC), high-density lipoprotein cholesterol (HDL-C), directly measured LDL-C using an enzymatic homogeneous assay, and triglycerides (TG). For subjects with TG lower than 400 mg/dL, Friedewald-estimated and directly measured LDL-C were highly correlated (r = 0.958, p < 0.001) and overall concordance was 82.7%. As TG increased, overall concordance decreased. Overall concordance was 85.4% at TG lower than 150 mg/dL; 78.2% at TG of 150-199 mg/dL; and 71.4% at TG of 200-399 mg/dL. The Friedewld equation tended to overestimate LDL-C when TG are of < 150 mg/dL; however, underestimate LDL-C when TG are of ${\geq}150mg/dL$. As a result, Friedewald estimation misclassified 382 subjects (9.1%) in a higher category versus 348 subjects (8.3%) in a lower category. Our findings suggest that overestimation of LDL-C by the Friedewald formula can be a great problem as well as underestimation.

Farm Size and Production Efficiency of Korean Rice Farms: An Application of a Rsy-Homothetic Stochsstic Production Function ("레이 동조 확률 생산함수"에 의한 경영규모별 미곡생산의 효율성 분석)

  • 강봉순;노재선
    • Journal of Korean Society of Rural Planning
    • /
    • v.1 no.1
    • /
    • pp.99-110
    • /
    • 1995
  • 이 연구는 한국 쌀생산의 효율성을 경영규모별로 파악하고, 영농규모 확대를 통한 쌀생산의 효율성 중대 가 가능하다는 가설을 검정해 보고자 하였다. 이 분석에 필요한 기술적 선도농가들의 생산함수인 프런티어(frontier) 생산함수를 구하기 위해서는 교 란항의 정보를 이용할 수 있는 확률(stochastic) 모형아 바람직하고, 아울러 경영규모별로 규모의 효율성을 파악하기 위해서는 레이 동조(ray-homothetic) 함수가 적절하다. 따라서 여기에서는 농림수산부의 1992년도 쌀생산비 자료에서 임의로 추출한 1,203호의 표본 자료를 이용해 앞에서 언급한 두가지 요소를 동시에 감안 할 수 있는 $\ulcorner$레이 동조 확률 생산함수(ray-homothetic stochastic production function)$\lrcorner$를 최우추정법 (Maximum likelilood estimation method)으로 추정하였으며, 이를 토대로 쌀생산의 경영규모별 비효율성 을 순수 기술적 비효율성과 규모의 비효율성으로 나누어 계측하였다. 게측결과에 의하면 쌀생산의 비효율성은 굉균 35.loyo에 이르고 있다. 이 가운데 순수 기술적 비효율성은 12.0%이고, 규모의 비효율성은 24.l%에 달했다. 기술적 비효율성과 규모의 비효율성 모두 경지규모 확대와 더불어 감소하는 것으로 나타나, 경영규모 확대와 더불어 미곡생산의 효율성이 증대될 수 있다는 가설은 기 각되지 않았다. 그러나 대농의 경우에도 규모의 비효율성이 여전히 높은 것으로 나타나 영농규모 확대를 저 해하는 제도적 장벽이 아직도 높다는 것을 알 수 있다. 아울러 대농과 소농과의 효율성 격차가 현저하지는 않은 것으로 나타나 단순히 경지를 중심으로 한 경영규모 확대만으로는 효율성 제고에 한계가 있음을 보여 주고 있다. 이 연구의 결과는 다음과 같은 정책적 함의를 가지고 있다. 첫째, 한국 미곡생산의 효율성 중대 잠재력이 결코 과소 평가되어서는 안된다. 둘째, 영농규모 확대가 쌀생산의 효율성 증대를 위해 필요한 것은 사실이지 만 단순한 경지규모의 확대에 치중하는 것보다 영농규모 확대를 저해하는 제도적 기술적 장애요인을 제거해 나가는 것이 더욱 중요하다. 마지막으로, 새로운 영농기술의 개발은 물론이고 현행 선진영농기술의 보급도 쌀생산의 효율성 중대에 상당한 역할을 할 수 있다는 사실이 간과되어서는 안된다.

  • PDF

An Evaluation of a Dasymetric Surface Model for Spatial Disaggregation of Zonal Population data (구역단위 인구자료의 공간적 세분화를 위한 밀도 구분적 표면모델에 대한 평가)

  • Jun, Byong-Woon
    • Journal of the Korean association of regional geographers
    • /
    • v.12 no.5
    • /
    • pp.614-630
    • /
    • 2006
  • Improved estimates of populations at risk for quick and effective response to natural and man-made disasters require spatial disaggregation of zonal population data because of the spatial mismatch problem in areal units between census and impact zones. This paper implements a dasymetric surface model to facilitate spatial disaggregation of the population of a census block group into populations associated with each constituent pixel and evaluates the performance of the surface-based spatial disaggregation model visually and statistically. The surface-based spatial disaggregation model employed geographic information systems (GIS) to enable dasymetric interpolation to be guided by satellite-derived land use and land cover data as additional information about the geographic distributor of population. In the spatial disaggregation, percent cover based empirical sampling and areal weighting techniques were used to objectively determine dasymetric weights for each grid cell. The dasymetric population surface for the Atlanta metropolitan area was generated by the surface-based spatial disaggregation model. The accuracy of the dasymetric population surface was tested on census counts using the root mean square error (RMSE) and an adjusted RMSE. The errors related to each census track and block group were also visualized by percent error maps. Results indicate that the dasymetric population surface provides high-precision estimates of populations as well as the detailed spatial distribution of population within census block groups. The results also demonstrate that the population surface largely tends to overestimate or underestimate population for both the rural and forested and the urban core areas.

  • PDF