• Title/Summary/Keyword: data sampling

Search Result 5,013, Processing Time 0.043 seconds

Classification of Class-Imbalanced Data: Effect of Over-sampling and Under-sampling of Training Data (계급불균형자료의 분류: 훈련표본 구성방법에 따른 효과)

  • 김지현;정종빈
    • The Korean Journal of Applied Statistics
    • /
    • v.17 no.3
    • /
    • pp.445-457
    • /
    • 2004
  • Given class-imbalanced data in two-class classification problem, we often do over-sampling and/or under-sampling of training data to make it balanced. We investigate the validity of such practice. Also we study the effect of such sampling practice on boosting of classification trees. Through experiments on twelve real datasets it is observed that keeping the natural distribution of training data is the best way if you plan to apply boosting methods to class-imbalanced data.

Variance estimation for distribution rate in stratified cluster sampling with missing values

  • Heo, Sunyeong
    • Journal of the Korean Data and Information Science Society
    • /
    • v.28 no.2
    • /
    • pp.443-449
    • /
    • 2017
  • Estimation of population proportion like the distribution rate of LED TV and the prevalence of a disease are often estimated based on survey sample data. Population proportion is generally considered as a special form of population mean. In complex sampling like stratified multistage sampling with unequal probability sampling, the denominator of mean may be random variable and it is estimated like ratio estimator. In this research, we examined the estimation of distribution rate based on stratified multistage sampling, and determined some numerical outcomes using stratified random sample data with about 25% of missing observations. In the data used for this research, the survey weight was determined by deterministic way. So, the weights are not random variable, and the population distribution rate and its variance estimator can be estimated like population mean estimation. When the weights are not random variable, if one estimates the variance of proportion estimator using ratio method, then the variances may be inflated. Therefore, in estimating variance for population proportion, we need to examine the structure of data and survey design before making any decision for estimation methods.

Adjusting sampling bias in case-control genetic association studies

  • Seo, Geum Chu;Park, Taesung
    • Journal of the Korean Data and Information Science Society
    • /
    • v.25 no.5
    • /
    • pp.1127-1135
    • /
    • 2014
  • Genome-wide association studies (GWAS) are designed to discover genetic variants such as single nucleotide polymorphisms (SNPs) that are associated with human complex traits. Although there is an increasing interest in the application of GWAS methodologies to population-based cohorts, many published GWAS have adopted a case-control design, which raise an issue related to a sampling bias of both case and control samples. Because of unequal selection probabilities between cases and controls, the samples are not representative of the population that they are purported to represent. Therefore, non-random sampling in case-control study can potentially lead to inconsistent and biased estimates of SNP-trait associations. In this paper, we proposed inverse-probability of sampling weights based on disease prevalence to eliminate a case-control sampling bias in estimation and testing for association between SNPs and quantitative traits. We apply the proposed method to a data from the Korea Association Resource project and show that the standard estimators applied to the weighted data yield unbiased estimates.

Application of Sampling Theories to Data from Bottom Trawl Surveys Along the Korean Coastal Areas for Inferring the Relative Size of a Fish Population (한반도 연근해 저층 트롤 조사 자료에 표본론을 적용한 개체군의 상대적 크기 추정)

  • Lee, Hyotae;Hyun, Saang-Yoon
    • Korean Journal of Fisheries and Aquatic Sciences
    • /
    • v.50 no.5
    • /
    • pp.594-604
    • /
    • 2017
  • The Korean National Institute of Fisheries Science (NIFS) has biannually (spring and fall, respectively) deployed a bottom trawl survey along the coastal areas for last decade, taking samples on a regular basis (i.e., a systematic sampling). Despite the availability of the survey data, NIFS has not yet officially reported the estimates of the groundfish population sizes as well as has not evaluated uncertainty of the estimates. The objectives of our study were to infer the relative size of a fish population, applying two different sampling techniques (namely simple and stratified sampling) with different observation units to the NIFS survey data, and to compare those two techniques in bias and precision. For demonstration purposes, we used data on Pacific cod (Gadus macrocephalus) collected by the 2011-2015 surveys, and the results of simple and stratified sampling showed that the point estimates and precision varied by observation unit as well as the sampling technique.

Design of Audio Sampling Rate Conversion Block (Audio Sampling Rate Conversion Block의 설계)

  • 정혜진;심윤정;이승준
    • Proceedings of the IEEK Conference
    • /
    • 2003.07b
    • /
    • pp.827-830
    • /
    • 2003
  • This paper proposes an area-efficient FIR filter architecture for sampling rate conversion of hi-fi audio data. Sampling rate conversion(SRC) block converts audio data sampled at 96KHz down to 48KHz sampled data and vice versa. 63-tap FIR filter coefficients have been synthesized that gives 100dB stop band attenuation and 5.2KHz transition bandwidth. Time-shared filter architecture requires only one multiplier and accumulator for 63-tap filter operation. This results in huge hardware saving of up to 10~19 times smaller compared with traditional FIR structure.

  • PDF

SAMPLING ERROR ANALYSIS FOR SOIL MOISTURE ESTIMATION

  • Kim, Gwang-Seob;Yoo, Chul-sang
    • Water Engineering Research
    • /
    • v.1 no.3
    • /
    • pp.209-222
    • /
    • 2000
  • A spectral formalism was applied to quantify the sampling errors due to spatial and/or temporal gaps in soil moisture measurements. The lack of temporal measurements of the two-dimensional soil moisture field makes it difficult to compute the spectra directly from observed records. Therefore, the space-time soil moisture spectra derived by stochastic models of rainfall and soil moisture was used in their record. Parameters for both models were tuned with Southern Great Plains Hydrology Experiment(SGP'97) data and the Oklahoma Mesonet data. The structure of soil moisture data is discrete in space and time. A design filter was developed to compute the sampling errors for discrete measurements in space and time. This filter has the advantage in its general form applicable for all kinds of sampling designs. Sampling errors of the soil moisture estimation during the SGP'97 Hydrology Experiment period were estimated. The sampling errors for various sampling designs such as satedlite over pass and point measurement ground probe were estimated under the climate condition between June and August 1997 and soil properties of the SGP'97 experimental area. The ground truth design was evaluated to 25km and 50km spatial gap and the temporal gap from zero to 5 days.

  • PDF

Study on the Effect of Training Data Sampling Strategy on the Accuracy of the Landslide Susceptibility Analysis Using Random Forest Method (Random Forest 기법을 이용한 산사태 취약성 평가 시 훈련 데이터 선택이 결과 정확도에 미치는 영향)

  • Kang, Kyoung-Hee;Park, Hyuck-Jin
    • Economic and Environmental Geology
    • /
    • v.52 no.2
    • /
    • pp.199-212
    • /
    • 2019
  • In the machine learning techniques, the sampling strategy of the training data affects a performance of the prediction model such as generalizing ability as well as prediction accuracy. Especially, in landslide susceptibility analysis, the data sampling procedure is the essential step for setting the training data because the number of non-landslide points is much bigger than the number of landslide points. However, the previous researches did not consider the various sampling methods for the training data. That is, the previous studies selected the training data randomly. Therefore, in this study the authors proposed several different sampling methods and assessed the effect of the sampling strategies of the training data in landslide susceptibility analysis. For that, total six different scenarios were set up based on the sampling strategies of landslide points and non-landslide points. Then Random Forest technique was trained on the basis of six different scenarios and the attribute importance for each input variable was evaluated. Subsequently, the landslide susceptibility maps were produced using the input variables and their attribute importances. In the analysis results, the AUC values of the landslide susceptibility maps, obtained from six different sampling strategies, showed high prediction rates, ranges from 70 % to 80 %. It means that the Random Forest technique shows appropriate predictive performance and the attribute importance for the input variables obtained from Random Forest can be used as the weight of landslide conditioning factors in the susceptibility analysis. In addition, the analysis results obtained using specific sampling strategies for training data show higher prediction accuracy than the analysis results using the previous random sampling method.

Simulated Annealing for Overcoming Data Imbalance in Mold Injection Process (사출성형공정에서 데이터의 불균형 해소를 위한 담금질모사)

  • Dongju Lee
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.45 no.4
    • /
    • pp.233-239
    • /
    • 2022
  • The injection molding process is a process in which thermoplastic resin is heated and made into a fluid state, injected under pressure into the cavity of a mold, and then cooled in the mold to produce a product identical to the shape of the cavity of the mold. It is a process that enables mass production and complex shapes, and various factors such as resin temperature, mold temperature, injection speed, and pressure affect product quality. In the data collected at the manufacturing site, there is a lot of data related to good products, but there is little data related to defective products, resulting in serious data imbalance. In order to efficiently solve this data imbalance, undersampling, oversampling, and composite sampling are usally applied. In this study, oversampling techniques such as random oversampling (ROS), minority class oversampling (SMOTE), ADASYN(Adaptive Synthetic Sampling), etc., which amplify data of the minority class by the majority class, and complex sampling using both undersampling and oversampling, are applied. For composite sampling, SMOTE+ENN and SMOTE+Tomek were used. Artificial neural network techniques is used to predict product quality. Especially, MLP and RNN are applied as artificial neural network techniques, and optimization of various parameters for MLP and RNN is required. In this study, we proposed an SA technique that optimizes the choice of the sampling method, the ratio of minority classes for sampling method, the batch size and the number of hidden layer units for parameters of MLP and RNN. The existing sampling methods and the proposed SA method were compared using accuracy, precision, recall, and F1 Score to prove the superiority of the proposed method.

Variations of Estimated Pollutant Loading from Rural Streams with Sampling Intervals (채수빈도를 고려한 소하천의 수질오염부하량 특성 연구)

  • 강문성;박승우;윤광식
    • Proceedings of the Korean Society of Agricultural Engineers Conference
    • /
    • 1998.10a
    • /
    • pp.552-557
    • /
    • 1998
  • Sampling schemes are intended for use in situations where stream-flow data are collected regularly, but concentration data are collected during only a limited number of time periods. Estimating water pollutant loading considering sampling intervals is presented, and for illustrative purposes the criterion is applied to the sampling station HS#3 of the Balan-reservoir watershed which is located at the southwest of Suwon. The stratification is employed uniformly for all sampling strategies in that the strata boundaries are defined using the actual distribution of flow values and the selected nonexceedence probabilities to minimize inaccuracy. Ratio estimator for SS, T-N, and T-P were used in order to calculate the water pollutant loading. A sampling scheme incorporating stratified sampling with real-time of the sampling characteristics is found to give the appropriate estimate of the mass load.

  • PDF

A Cost Effective Reference Data Sampling Algorithm Using Fractal Analysis

  • Lee, Byoung-Kil;Eo, Yang-Dam;Jeong, Jae-Joon;Kim, Yong-Il
    • ETRI Journal
    • /
    • v.23 no.3
    • /
    • pp.129-137
    • /
    • 2001
  • A random sampling or systematic sampling method is commonly used to assess the accuracy of classification results. In remote sensing, with these sampling methods, much time and tedious work are required to acquire sufficient ground truth data. So, a more effective sampling method that can represent the characteristics of the population is required. In this study, fractal analysis is adopted as an index for reference sampling. The fractal dimensions of the whole study area and the sub-regions are calculated to select sub-regions that have the most similar dimensionality to that of the whole area. Then the whole area's classification accuracy is compared with those of sub-regions, and it is verified that the accuracies of selected sub-regions are similar to that of whole area. A new kind of reference sampling method using the above procedure is proposed. The results show that it is possible to reduce sampling area and sample size, while keeping the same level of accuracy as the existing methods.

  • PDF