• 제목/요약/키워드: data sampling

검색결과 5,029건 처리시간 0.035초

계급불균형자료의 분류: 훈련표본 구성방법에 따른 효과 (Classification of Class-Imbalanced Data: Effect of Over-sampling and Under-sampling of Training Data)

  • 김지현;정종빈
    • 응용통계연구
    • /
    • 제17권3호
    • /
    • pp.445-457
    • /
    • 2004
  • 두 계급의 분류문제에서 두 계급의 관측 개체수가 심하게 불균형을 이룬 자료를 분석할 때, 흔히 인위적으로 두 계급의 크기를 비슷하게 해준 다음 분석한다. 본 연구에서는 이런 훈련표본 구성방법의 타당성에 대해 알아보았다. 또한 훈련표본의 구성방법이 부스팅에 미치는 효과에 대해서도 알아보았다. 12개의 실제 자료에 대한 실험 결과 나무모형으로 부스팅 기법을 적용할 때는 훈련표본을 그대로 둔 채 분석하는 것이 좋다는 결론을 얻었다.

Variance estimation for distribution rate in stratified cluster sampling with missing values

  • Heo, Sunyeong
    • Journal of the Korean Data and Information Science Society
    • /
    • 제28권2호
    • /
    • pp.443-449
    • /
    • 2017
  • Estimation of population proportion like the distribution rate of LED TV and the prevalence of a disease are often estimated based on survey sample data. Population proportion is generally considered as a special form of population mean. In complex sampling like stratified multistage sampling with unequal probability sampling, the denominator of mean may be random variable and it is estimated like ratio estimator. In this research, we examined the estimation of distribution rate based on stratified multistage sampling, and determined some numerical outcomes using stratified random sample data with about 25% of missing observations. In the data used for this research, the survey weight was determined by deterministic way. So, the weights are not random variable, and the population distribution rate and its variance estimator can be estimated like population mean estimation. When the weights are not random variable, if one estimates the variance of proportion estimator using ratio method, then the variances may be inflated. Therefore, in estimating variance for population proportion, we need to examine the structure of data and survey design before making any decision for estimation methods.

Adjusting sampling bias in case-control genetic association studies

  • Seo, Geum Chu;Park, Taesung
    • Journal of the Korean Data and Information Science Society
    • /
    • 제25권5호
    • /
    • pp.1127-1135
    • /
    • 2014
  • Genome-wide association studies (GWAS) are designed to discover genetic variants such as single nucleotide polymorphisms (SNPs) that are associated with human complex traits. Although there is an increasing interest in the application of GWAS methodologies to population-based cohorts, many published GWAS have adopted a case-control design, which raise an issue related to a sampling bias of both case and control samples. Because of unequal selection probabilities between cases and controls, the samples are not representative of the population that they are purported to represent. Therefore, non-random sampling in case-control study can potentially lead to inconsistent and biased estimates of SNP-trait associations. In this paper, we proposed inverse-probability of sampling weights based on disease prevalence to eliminate a case-control sampling bias in estimation and testing for association between SNPs and quantitative traits. We apply the proposed method to a data from the Korea Association Resource project and show that the standard estimators applied to the weighted data yield unbiased estimates.

한반도 연근해 저층 트롤 조사 자료에 표본론을 적용한 개체군의 상대적 크기 추정 (Application of Sampling Theories to Data from Bottom Trawl Surveys Along the Korean Coastal Areas for Inferring the Relative Size of a Fish Population)

  • 이효태;현상윤
    • 한국수산과학회지
    • /
    • 제50권5호
    • /
    • pp.594-604
    • /
    • 2017
  • The Korean National Institute of Fisheries Science (NIFS) has biannually (spring and fall, respectively) deployed a bottom trawl survey along the coastal areas for last decade, taking samples on a regular basis (i.e., a systematic sampling). Despite the availability of the survey data, NIFS has not yet officially reported the estimates of the groundfish population sizes as well as has not evaluated uncertainty of the estimates. The objectives of our study were to infer the relative size of a fish population, applying two different sampling techniques (namely simple and stratified sampling) with different observation units to the NIFS survey data, and to compare those two techniques in bias and precision. For demonstration purposes, we used data on Pacific cod (Gadus macrocephalus) collected by the 2011-2015 surveys, and the results of simple and stratified sampling showed that the point estimates and precision varied by observation unit as well as the sampling technique.

Audio Sampling Rate Conversion Block의 설계 (Design of Audio Sampling Rate Conversion Block)

  • 정혜진;심윤정;이승준
    • 대한전자공학회:학술대회논문집
    • /
    • 대한전자공학회 2003년도 하계종합학술대회 논문집 II
    • /
    • pp.827-830
    • /
    • 2003
  • This paper proposes an area-efficient FIR filter architecture for sampling rate conversion of hi-fi audio data. Sampling rate conversion(SRC) block converts audio data sampled at 96KHz down to 48KHz sampled data and vice versa. 63-tap FIR filter coefficients have been synthesized that gives 100dB stop band attenuation and 5.2KHz transition bandwidth. Time-shared filter architecture requires only one multiplier and accumulator for 63-tap filter operation. This results in huge hardware saving of up to 10~19 times smaller compared with traditional FIR structure.

  • PDF

SAMPLING ERROR ANALYSIS FOR SOIL MOISTURE ESTIMATION

  • Kim, Gwang-Seob;Yoo, Chul-sang
    • Water Engineering Research
    • /
    • 제1권3호
    • /
    • pp.209-222
    • /
    • 2000
  • A spectral formalism was applied to quantify the sampling errors due to spatial and/or temporal gaps in soil moisture measurements. The lack of temporal measurements of the two-dimensional soil moisture field makes it difficult to compute the spectra directly from observed records. Therefore, the space-time soil moisture spectra derived by stochastic models of rainfall and soil moisture was used in their record. Parameters for both models were tuned with Southern Great Plains Hydrology Experiment(SGP'97) data and the Oklahoma Mesonet data. The structure of soil moisture data is discrete in space and time. A design filter was developed to compute the sampling errors for discrete measurements in space and time. This filter has the advantage in its general form applicable for all kinds of sampling designs. Sampling errors of the soil moisture estimation during the SGP'97 Hydrology Experiment period were estimated. The sampling errors for various sampling designs such as satedlite over pass and point measurement ground probe were estimated under the climate condition between June and August 1997 and soil properties of the SGP'97 experimental area. The ground truth design was evaluated to 25km and 50km spatial gap and the temporal gap from zero to 5 days.

  • PDF

Random Forest 기법을 이용한 산사태 취약성 평가 시 훈련 데이터 선택이 결과 정확도에 미치는 영향 (Study on the Effect of Training Data Sampling Strategy on the Accuracy of the Landslide Susceptibility Analysis Using Random Forest Method)

  • 강경희;박혁진
    • 자원환경지질
    • /
    • 제52권2호
    • /
    • pp.199-212
    • /
    • 2019
  • 머신러닝 기법을 활용한 분석에서 훈련 데이터의 샘플링 전략은 예측 정확도 뿐 만 아니라 일반화 능력에도 많은 영향을 미친다. 특히, 산사태 취약성 분석의 경우, 산사태 발생부에 대한 정보에 비해 산사태 미발생부에 대한 정보가 과도하게 많은 데이터 불균형 현상이 발생하며, 이에 따라 분석 모델의 훈련 데이터 설계 시 데이터 샘플링 과정이 필수적이다. 그러나 기존의 연구들은 대부분 산사태 미발생부 선택 시 발생부 데이터와 1:1의 비율을 갖도록 무작위로 선택하는 방법을 적용하였을 뿐, 특정한 선택 기준에 따라 분석을 수행하지 않았다. 따라서 본 연구에서는 훈련 데이터의 샘플링 전략이 모델의 예측 성능에 미치는 결과를 확인하기 위하여 산사태 발생부와 미발생부의 샘플링 전략기준에 따라 서로 다른 6개의 시나리오를 만들어 Random Forest 모델의 훈련에 사용하였다. 또한 Random Forest의 결과 중 하나인 변수 중요도를 각 산사태 유발인자들에 가중치로 곱하여 줌으로써 산사태 취약지수 값을 산정하였으며, 취약지수 값을 이용해 산사태 취약성도를 제작하고 각 결과 지도의 정확도를 비교 분석하였다. 분석 결과, 훈련데이터의 샘플링 방법에 상관없이 두 지역의 산사태 취약성 분석 결과는 모두 70~80%의 정확도를 보였다. 이를 통해 Random Forest 기법의 산사태 취약성 분석기법으로서의 적용 가능성을 확인하였으며, Random Forest 모델이 제공하는 입력변수의 중요도를 산사태 유발인자 가중치로 활용할 수 있음을 확인하였다. 또한 훈련 시나리오 간의 정확도를 비교한 결과, 특정한 기준에 의해 훈련 데이터를 설계하는 것이 기존의 랜덤 선택 방법보다 높은 예측 정확도를 기대할 수 있음을 확인하였다.

사출성형공정에서 데이터의 불균형 해소를 위한 담금질모사 (Simulated Annealing for Overcoming Data Imbalance in Mold Injection Process)

  • 이동주
    • 산업경영시스템학회지
    • /
    • 제45권4호
    • /
    • pp.233-239
    • /
    • 2022
  • The injection molding process is a process in which thermoplastic resin is heated and made into a fluid state, injected under pressure into the cavity of a mold, and then cooled in the mold to produce a product identical to the shape of the cavity of the mold. It is a process that enables mass production and complex shapes, and various factors such as resin temperature, mold temperature, injection speed, and pressure affect product quality. In the data collected at the manufacturing site, there is a lot of data related to good products, but there is little data related to defective products, resulting in serious data imbalance. In order to efficiently solve this data imbalance, undersampling, oversampling, and composite sampling are usally applied. In this study, oversampling techniques such as random oversampling (ROS), minority class oversampling (SMOTE), ADASYN(Adaptive Synthetic Sampling), etc., which amplify data of the minority class by the majority class, and complex sampling using both undersampling and oversampling, are applied. For composite sampling, SMOTE+ENN and SMOTE+Tomek were used. Artificial neural network techniques is used to predict product quality. Especially, MLP and RNN are applied as artificial neural network techniques, and optimization of various parameters for MLP and RNN is required. In this study, we proposed an SA technique that optimizes the choice of the sampling method, the ratio of minority classes for sampling method, the batch size and the number of hidden layer units for parameters of MLP and RNN. The existing sampling methods and the proposed SA method were compared using accuracy, precision, recall, and F1 Score to prove the superiority of the proposed method.

채수빈도를 고려한 소하천의 수질오염부하량 특성 연구 (Variations of Estimated Pollutant Loading from Rural Streams with Sampling Intervals)

  • 강문성;박승우;윤광식
    • 한국농공학회:학술대회논문집
    • /
    • 한국농공학회 1998년도 학술발표회 발표논문집
    • /
    • pp.552-557
    • /
    • 1998
  • Sampling schemes are intended for use in situations where stream-flow data are collected regularly, but concentration data are collected during only a limited number of time periods. Estimating water pollutant loading considering sampling intervals is presented, and for illustrative purposes the criterion is applied to the sampling station HS#3 of the Balan-reservoir watershed which is located at the southwest of Suwon. The stratification is employed uniformly for all sampling strategies in that the strata boundaries are defined using the actual distribution of flow values and the selected nonexceedence probabilities to minimize inaccuracy. Ratio estimator for SS, T-N, and T-P were used in order to calculate the water pollutant loading. A sampling scheme incorporating stratified sampling with real-time of the sampling characteristics is found to give the appropriate estimate of the mass load.

  • PDF

A Cost Effective Reference Data Sampling Algorithm Using Fractal Analysis

  • Lee, Byoung-Kil;Eo, Yang-Dam;Jeong, Jae-Joon;Kim, Yong-Il
    • ETRI Journal
    • /
    • 제23권3호
    • /
    • pp.129-137
    • /
    • 2001
  • A random sampling or systematic sampling method is commonly used to assess the accuracy of classification results. In remote sensing, with these sampling methods, much time and tedious work are required to acquire sufficient ground truth data. So, a more effective sampling method that can represent the characteristics of the population is required. In this study, fractal analysis is adopted as an index for reference sampling. The fractal dimensions of the whole study area and the sub-regions are calculated to select sub-regions that have the most similar dimensionality to that of the whole area. Then the whole area's classification accuracy is compared with those of sub-regions, and it is verified that the accuracies of selected sub-regions are similar to that of whole area. A new kind of reference sampling method using the above procedure is proposed. The results show that it is possible to reduce sampling area and sample size, while keeping the same level of accuracy as the existing methods.

  • PDF