• 제목/요약/키워드: Sampling

검색결과 12,637건 처리시간 0.039초

Heterogeneous Ensemble of Classifiers from Under-Sampled and Over-Sampled Data for Imbalanced Data

  • Kang, Dae-Ki;Han, Min-gyu
    • International journal of advanced smart convergence
    • /
    • 제8권1호
    • /
    • pp.75-81
    • /
    • 2019
  • Data imbalance problem is common and causes serious problem in machine learning process. Sampling is one of the effective methods for solving data imbalance problem. Over-sampling increases the number of instances, so when over-sampling is applied in imbalanced data, it is applied to minority instances. Under-sampling reduces instances, which usually is performed on majority data. We apply under-sampling and over-sampling to imbalanced data and generate sampled data sets. From the generated data sets from sampling and original data set, we construct a heterogeneous ensemble of classifiers. We apply five different algorithms to the heterogeneous ensemble. Experimental results on an intrusion detection dataset as an imbalanced datasets show that our approach shows effective results.

출구조사를 위한 투표소 확률추출 방법 (Probability Sampling to Select Polling Places in Exit Poll)

  • 김영원;엄윤희
    • 한국조사연구학회지:조사연구
    • /
    • 제6권2호
    • /
    • pp.1-32
    • /
    • 2005
  • 출구조사에서 투표소 추출방법은 출구조사의 정확성을 결정하는 중요한 요소이다. 본 연구에서는 대표구 추출법을 대신할 수 있는 정렬계통추출법을 제안하고 그 활용 가능성 및 효율성을 분석한다. 아울러 제시된 정렬계통추출법을 사용하는 경우 추정량의 표본추출오차(sampling error)가 어느 정도 되며, 원하는 목표 오차를 만족하기 위한 표본크기를 결정하는 문제를 고려한다. 2004년 17대 총선 개표자료를 토대로 경험적인 분석을 통해 제시된 정렬계통추출법이 기존의 대표구 추출법에 비해 평균예측오차 관점에서 효율적이라는 사실을 규명하고, 기존의 출구조사에서 표본크기 및 추정오차를 해석하는 과정에서 발생하는 오류를 집락효과를 이용해 설명했다. 아울러 제안한 정렬추출법에서 얻어지는 추정량의 분산을 구하고, 설계효과 개념을 이용해 표본크기 결정문제를 다루었다.

  • PDF

Probability Sampling Method for a Hidden Population Using Respondent-Driven Sampling: Simulation for Cancer Survivors

  • Jung, Minsoo
    • Asian Pacific Journal of Cancer Prevention
    • /
    • 제16권11호
    • /
    • pp.4677-4683
    • /
    • 2015
  • When there is no sampling frame within a certain group or the group is concerned that making its population public would bring social stigma, we say the population is hidden. It is difficult to approach this kind of population survey-methodologically because the response rate is low and its members are not quite honest with their responses when probability sampling is used. The only alternative known to address the problems caused by previous methods such as snowball sampling is respondent-driven sampling (RDS), which was developed by Heckathorn and his colleagues. RDS is based on a Markov chain, and uses the social network information of the respondent. This characteristic allows for probability sampling when we survey a hidden population. We verified through computer simulation whether RDS can be used on a hidden population of cancer survivors. According to the simulation results of this thesis, the chain-referral sampling of RDS tends to minimize as the sample gets bigger, and it becomes stabilized as the wave progresses. Therefore, it shows that the final sample information can be completely independent from the initial seeds if a certain level of sample size is secured even if the initial seeds were selected through convenient sampling. Thus, RDS can be considered as an alternative which can improve upon both key informant sampling and ethnographic surveys, and it needs to be utilized for various cases domestically as well.

Gy의 입자성 물질 시료채취이론에 근거한 토양 시료 채취량 결정 (Determination of Soil Sample Size Based on Gy's Particulate Sampling Theory)

  • 배범한
    • 한국지하수토양환경학회지:지하수토양환경
    • /
    • 제16권6호
    • /
    • pp.1-9
    • /
    • 2011
  • A bibliographical review of Gy sampling theory for particulate materials was conducted to provide readers with useful means to reduce errors in soil contamination investigation. According to the Gy theory, the errors caused by the heterogeneous nature of soil include; the fundamental error (FE) caused by physical and chemical constitutional heterogeneity, the grouping and segregation error (GE) aroused from gravitational force, long-range heterogeneous fluctuation error ($CE_2$), the periodic heterogeneity fluctuation error ($CE_3$), and the materialization error (ME) generated during physical process of sample treatment. However, the accurate estimation of $CE_2$ and $CE_3$ cannot be estimated easily and only increasing sampling locations can reduce the magnitude of the errors. In addition, incremental sampling is the only method to reduce GE while grab sampling should be avoided as it introduces uncertainty and errors to the sampling process. Correct preparation and operation of sampling tools are important factors in reducing the incremental delimitation error (DE) and extraction error (EE) which are resulted from physical processes in the sampling. Therefore, Gy sampling theory can be used efficiently in planning a strategy for soil investigations of non-volatile and non-reactive samples.

전화조사를 위한 시간균형할당표본추출 (Time-Balanced Quota Sampling for Telephone Survey)

  • 허명회;황진모
    • 한국조사연구학회지:조사연구
    • /
    • 제7권2호
    • /
    • pp.39-52
    • /
    • 2006
  • 우리나라 대다수 조사전문기관은 지역 성 나이대 할당표본추출에 의한 전화조사를 하고 있다. 그러나 평일에는 인구사회적 속성에 따른 개인별 재택률의 차이가 심하므로 체계적 응답자선택편향(respondent selection bias)이 우려된다. 문제 해결을 위해 조사시간대를 할당변수로 추가한 '시간균형할당표본추출'(time-balanced quota sampling) 방법과 저녁시간대 할당을 부분적으로 완화한 '시간균형준할당표본추출'(time-balanced quasi-quota sampling) 방법을 제안한다. 그리고 우리나라 통계청에서 2004년에 수집한 생활시간조사 원자료를 가상적 모집단으로 설정하여 새로운 할당추출법과 기존할당추출법에 의해 얻는 몬테칼로 표본들을 비교할 것이다.

  • PDF

Comparative Assessment of a Self-sampling Device and Gynecologist Sampling for Cytology and HPV DNA Detection in a Rural and Low Resource Setting: Malaysian Experience

  • Latiff, Latiffah A;Ibrahim, Zaidah;Pei, Chong Pei;Rahman, Sabariah Abdul;Akhtari-Zavare, Mehrnoosh
    • Asian Pacific Journal of Cancer Prevention
    • /
    • 제16권18호
    • /
    • pp.8495-8501
    • /
    • 2016
  • Purpose: This study was conducted to assess the agreement and differences between cervical self-sampling with a Kato device (KSSD) and gynecologist sampling for Pap cytology and human papillomavirus DNA (HPV DNA) detection. Materials and Methods: Women underwent self-sampling followed by gynecologist sampling during screening at two primary health clinics. Pap cytology of cervical specimens was evaluated for specimen adequacy, presence of endocervical cells or transformation zone cells and cytological interpretation for cells abnormalities. Cervical specimens were also extracted and tested for HPV DNA detection. Positive HPV smears underwent gene sequencing and HPV genotyping by referring to the online NCBI gene bank. Results were compared between samplings by Kappa agreement and McNemar test. Results: For Pap specimen adequacy, KSSD showed 100% agreement with gynecologist sampling but had only 32.3% agreement for presence of endocervical cells. Both sampling showed 100% agreement with only 1 case detected HSIL favouring CIN2 for cytology result. HPV DNA detection showed 86.2%agreement (K=0.64, 95% CI 0.524-0.756, p=0.001) between samplings. KSSD and gynaecologist sampling identified high risk HPV in 17.3% and 23.9% respectively (p=0.014). Conclusion: The self-sampling using Kato device can serve as a tool in Pap cytology and HPV DNA detection in low resource settings in Malaysia. Self-sampling devices such as KSSD can be used as an alternative technique to gynaecologist sampling for cervical cancer screening among rural populations in Malaysia.

A Comparison of Systematic Sampling Designs for Forest Inventory

  • Yim, Jong Su;Kleinn, Christoph;Kim, Sung Ho;Jeong, Jin-Hyun;Shin, Man Yong
    • 한국산림과학회지
    • /
    • 제98권2호
    • /
    • pp.133-141
    • /
    • 2009
  • This study was conducted to support for determining an efficient sampling design for forest resources assessments in South Korea with respect to statistical efficiency. For this objective, different systematic sampling designs were simulated and compared based on an artificial forest population that had been built from field sample data and satellite data in Yang-Pyeong County, Korea. Using the k-NN technique, two thematic maps (growing stock and forest cover type per pixel unit) across the test area were generated; field data (n=191) and Landsat ETM+ were used as source data. Four sampling designs (systematic sampling, systematic sampling for post-stratification, systematic cluster sampling, and stratified systematic sampling) were employed as optimum sampling design candidates. In order to compute error variance, the Monte Carlo simulation was used (k=1,000). Then, sampling error and relative efficiency were compared. When the objective of an inventory was to obtain estimations for the entire population, systematic cluster sampling was superior to the other sampling designs. If its objective is to obtain estimations for each sub-population, post-stratification gave a better estimation. In order to successfully perform this procedure, it requires clear definitions of strata of interest per field observation unit for efficient stratification.

A Sampling Inspection Plan with Human Error: Considering the Relationship between Visual Inspection Time and Human Error Rate

  • Lee, Yong-Hwa;Hong, Seung-Kweon
    • 대한인간공학회지
    • /
    • 제30권5호
    • /
    • pp.645-650
    • /
    • 2011
  • Objective: The aim of this study is to design a sampling inspection plan with human error which is changing according to inspection time. Background: Typical sampling inspection plans have been established typically based on an assumption of the perfect inspection without human error. However, most of all inspection tasks include human errors in the process of inspection. Therefore, a sampling inspection plan should be designed with consideration of imperfect inspection. Method: A model for single sampling inspection plans were proposed for the cases that visual inspection error rate is changing according to inspection time. Additionally, a sampling inspection plan for an optimal inspection time was proposed. In order to show an applied example of the proposed model, an experiment for visual inspection task was performed and the inspection error rates were measured according to the inspection time. Results: Inspection error rates changed according to inspection time. The inspection error rate could be reflected on the single sampling inspection plans for attribute. In particular, inspection error rate in an optimal inspection time may be used for a reasonable single sampling plan in a practical view. Conclusion: Human error rate in inspection tasks should be reflected on typical single sampling inspection plans. A sampling inspection plan with consideration of human error requires more sampling number than a typical sampling plan with perfect inspection. Application: The result of this research may help to determine more practical sampling inspection plan rather than typical one.

파랑자료의 sampling rate가 극한파의 통계에 미치는 영향 (The Effect of Sampling Rate on Statistical Properties of Extreme Wave)

  • 김도영
    • 한국해양환경ㆍ에너지학회지
    • /
    • 제16권1호
    • /
    • pp.36-41
    • /
    • 2013
  • 이 논문에서는 시계열의 파랑자료를 시뮬레이션 하여 파랑계측에서 sampling rate가 파랑자료의 각종 통계적 특성에 미치는 영향을 살펴보았다. 파랑자료의 Sampling rate가 freak wave와 같은 극한파의 통계특성에 미치는 영향을 파악하기 위하여, 이상(AI)지수, 파형의 첨도(kurtosis) 그리고 최대파고 등의 변화를 살펴보았다. Sampling rate가 커지면 각종 파고의 크기가 줄어드는 경향을 보인다. Sampling rate가 커지면 파랑스펙트럼의 0차 모멘트는 큰 변화가 없지만 2차 모멘트는 큰 영향을 받아서, Tz는 과대평가되고 대역폭은 과소평가된다. 따라서 sampling rate변화에 따른 유의파고 크기의 오차는 스펙트럼법에 의한 유의파고 $H_s$가 개별파법에 의한 유의파고 $H_{1/3}$ 보다 작게 나타난다. Sampling rate에 의해서 발생한 오차의 크기는 파랑의 주기가 커지면 줄어드는 경향을 보인다. 파형의 첨도와 AI지수는 sampling rate가 1 Hz 이상인 경우는 큰 오차를 주지 않는다. 일반적으로 freak wave와 같은 극한파가 포함된 파랑을 계측할 때, 1 Hz의 이상의 samping rate로 계측한 해양파의 자료를 사용한다면 sampling rate가 최대파고의 크기의 미치는 오차가 5% 이하가 될 것으로 예상된다.

공기중 염화비닐단량체의 포집시 공기 포집량이 파과에 미치는 영향 (Effect of sampling volume on the breakthrough of charcoal tube during vinyl chloride monomer sampling)

  • 윤존중;임남구;김치년;노재훈
    • 한국산업보건학회지
    • /
    • 제11권3호
    • /
    • pp.241-248
    • /
    • 2001
  • The main factors of breakthrough are known to sampling time, flow rate, concentration of the sample, temperature, humidity, and the physical characteristics of the solid sorbent tube. However, no study has been reported the effect of temperature and sampling volume on the breakthrough of acharcoal tube during vinyl chloride monomer (VCM) sampling. The objective of this study is to suggest the optimal sampling condition during VCM sampling based on National Institute for Occupational Safety and Health (NIOSH) method. To evaluate adequate sampling volume for VCM without breakthrough, volume of 1, 2, 3, 4, and 5 L each from VCM of 1, 5, 10, 15, and 20ppm at flow rate of 0.05 L/min were sampled in $22^{\circ}C$ and $40^{\circ}C$. At $22^{\circ}C$, in the case of 1, 5, 10, and 15ppm, VCM was adsorbed completely in first section of charcoal tube regardless of sampling volume. But in 20ppm, detection rates are 99.56% in first section and 0.44% in second section. At $40^{\circ}C$ of 1ppm, VCM was adsorbed completely in first section. In 10, 15, and 20ppm, detection rates of second, third, and forth sections were decreased significantly by reduction of sampling volume. In determination of breakthrough based on NIOSH method, no breakthrough was occurred in 20ppm at $22^{\circ}C$. At $40^{\circ}C$, breakthrough was occurred in 10, 15, and 20ppm when sampling volume was 5L. Although no breakthrough was occurred when sampling volume was 3L. Finally, in environment of temperature around $22^{\circ}C$, breakthrough may not occurred up to 20ppm during sampling for VCM. During sampling for VCM in environment of temperature around $40^{\circ}C$, no breakthrough occurred in 1-5ppm and 10-20ppm when sampling volume is 5L and 3L respectively. This result suggests that the sampling volume should be considered when VCM sampling under hot conditions (> $22^{\circ}C$) by the NIOSH method No. 1007.

  • PDF