• 제목/요약/키워드: under-sampling

검색결과 1,082건 처리시간 0.034초

Heterogeneous Ensemble of Classifiers from Under-Sampled and Over-Sampled Data for Imbalanced Data

  • Kang, Dae-Ki;Han, Min-gyu
    • International journal of advanced smart convergence
    • /
    • 제8권1호
    • /
    • pp.75-81
    • /
    • 2019
  • Data imbalance problem is common and causes serious problem in machine learning process. Sampling is one of the effective methods for solving data imbalance problem. Over-sampling increases the number of instances, so when over-sampling is applied in imbalanced data, it is applied to minority instances. Under-sampling reduces instances, which usually is performed on majority data. We apply under-sampling and over-sampling to imbalanced data and generate sampled data sets. From the generated data sets from sampling and original data set, we construct a heterogeneous ensemble of classifiers. We apply five different algorithms to the heterogeneous ensemble. Experimental results on an intrusion detection dataset as an imbalanced datasets show that our approach shows effective results.

베이지안 다특성(多特性) 단축(短縮) 샘플링 검사 방식의 설계 (Bayesian Multiattribute Acceptance Sampling Plans under Curtailed Inspection)

  • 이종성
    • 산업기술연구
    • /
    • 제9권
    • /
    • pp.51-56
    • /
    • 1989
  • A methodology for determining optimal sampling plans for Bayesian multiattribute curtailed inspection models is proposed, whereby sampling inspection is terminated as soon as the disposition of the inspection lot is determined. An iterative solution procedure is developed for obtaining optimal multiattribute acceptance sampling plans under cuntailed sampling inspection.

  • PDF

On the Estimation of Fraction Defectives

  • Kim, Seong-in
    • 품질경영학회지
    • /
    • 제8권2호
    • /
    • pp.3-14
    • /
    • 1980
  • This paper is concerned with the design of an appropriate sampling plan or stopping rule and the construction of estimate for the estimation of process or lot fraction defective. Various sampling plans which are well known or have potential applications are unified into a generalized sampling plan. Under this sampling plan sufficient statistic, probability distribution, moment, and minimum variance unbiased estimate are obtained. Results for various sampling plans can be derived as special cases. Then, under given parameter values, the relative efficiencies of the various sampling plans are compared with respect to expected sample sizes and variances of estimates.

  • PDF

부도예측 개선을 위한 하이브리드 언더샘플링 접근법 (A Hybrid Under-sampling Approach for Better Bankruptcy Prediction)

  • 김태훈;안현철
    • 지능정보연구
    • /
    • 제21권2호
    • /
    • pp.173-190
    • /
    • 2015
  • 부도는 막대한 사회적, 경제적 손실을 야기할 수 있으므로, 미리 부도여부를 정확하게 예측하여 선제 대응하는 것은 경영분야에서 대단히 중요한 의사결정문제 중 하나이다. 이에 지능정보시스템 분야에서도 그간 기업의 재무 데이터에 기반해 부도예측을 개선하기 위한 노력을 기울여왔는데, 안타깝게도 기존의 연구들은 대부분 분류모형의 성능 개선을 통해 예측 정확도를 개선하는 것에만 주로 초점을 맞추어 다른 요소들을 충분히 고려하지 못했다는 한계가 있다. 이러한 배경에서 본 연구는 부도예측 모형의 정확도를 개선하기 위한 방편으로 새로운 데이터 전처리 방법, 그 중에서도 효과적인 표본추출 방법을 제안하고자 한다. 일반적으로 부도예측을 위해 사용되는 데이터들은 극심한 데이터 불균형 문제에 노출되어 있는데, 본 연구에서는 k-reverse nearest neighbor(k-RNN)와 one-class support vector machine(OCSVM) 방법을 결합한 하이브리드 언더샘플링(hybrid under-sampling) 접근법을 통해 이같은 데이터 불균형 문제를 해결하고자 하였다. 본 연구에서 제안한 접근법에서 k-RNN은 이상치를 효과적으로 제거할 수 있으며, OCSVM은 다수를 구성하는 등급의 데이터로부터 정보량이 풍부한 표본만 효과적으로 선택할 수 있는 수단으로 활용될 수 있다. 제안된 기법의 성능을 검증하기 위해, 본 연구에서는 국내 한 은행의 비외감기업 부도예측모형 구축에 제안 기법을 적용해 본 뒤, 일반적으로 많이 사용되는 랜덤샘플링(random sampling)과 제안 기법의 성능을 비교해 보았다. 그 결과, 로지스틱 회귀분석, 판별분석, 의사결정나무, SVM 등 대다수의 분류모형에 있어 분류 정확도가 개선됨을 확인할 수 있었으며, 모든 분류모형에 있어 부정 오류, 즉 부실기업을 정상으로 예측하는 오류율이 크게 감소함을 확인할 수 있었다.

The systematic sampling for inferring the survey indices of Korean groundfish stocks

  • Hyun, Saang-Yoon;Seo, Young IL
    • Fisheries and Aquatic Sciences
    • /
    • 제21권8호
    • /
    • pp.24.1-24.9
    • /
    • 2018
  • The Korean bottom trawl survey has been deployed on a regular basis for about the last decade as part of groundfish stock assessments. The regularity indicates that they sample groundfish once per grid cell whose sides are half of one latitude and that of one longitude, respectively, and whose inside is furthermore divided into nine nested grids. Unless they have a special reason (e.g., running into a rocky bottom), their sample location is at the center grid of the nine nested grids. Given data collected by the survey, we intended to show how to appropriately estimate not only the survey index of a fish stock but also its uncertainty. For the regularity reason, we applied the systematic sampling theory for the above purposes and compared its results with a reference, which was based on the simple random sampling. When using the survey data about 11 fish stocks, collected by the spring and fall surveys in 2014, the survey indices of those stocks estimated under the systematic sampling were overall more precise than those under the simple random sampling. In estimates of the survey indices in number, the standard errors of those estimates under the systematic sampling were reduced from those under the simple random sampling by 0.23~27.44%, while in estimates of the survey indices in weight, they decreased by 0.04~31.97%. In bias of the estimates, the systematic sampling was the same as the simple random sampling. Our paper is first in formally showing how to apply the systematic sampling theory to the actual data collected by the Korean bottom trawl surveys.

와이블 고장모형 하에서의 이중샘플링 T2 관리도의 경제적-통계적 설계 (이중샘플링 T2 관리도의 경제적-통계적 설계) (Economic-Statistical Design of Double Sampling T2 Control Chart under Weibull Failure Model)

  • 홍성옥;이민구;이주호
    • 품질경영학회지
    • /
    • 제43권4호
    • /
    • pp.471-488
    • /
    • 2015
  • Purpose: Double sampling $T^2$ chart is a useful tool for detecting a relatively small shift in process mean when the process is controlled by multiple variables. This paper finds the optimal design of the double sampling $T^2$ chart in both economical and statistical sense under Weibull failure model. Methods: The expected cost function is mathematically derived using recursive equation approach. The optimal designs are found using a genetic algorithm for numerical examples and compared to those of single sampling $T^2$ chart. Sensitivity analysis is performed to see the parameter effects. Results: The proposed design outperforms the optimal design of the single sampling $T^2$ chart in terms of the expected cost per unit time and Type-I error rate for all the numerical examples considered. Conclusion: Double sampling $T^2$ chart can be designed to satisfy both economic and statistical requirements under Weibull failure model and the resulting design is better than the single sampling counterpart.

New Attributes and Variables Control Charts under Repetitive Sampling

  • Aslam, Muhammad;Azam, Muhammad;Jun, Chi-Hyuck
    • Industrial Engineering and Management Systems
    • /
    • 제13권1호
    • /
    • pp.101-106
    • /
    • 2014
  • New control charts under repetitive sampling are proposed, which can be used for variables and attributes quality characteristics. The proposed control charts have inner and outer control limits so that repetitive sampling may be needed if the plotted statistic falls between the two limits. Particularly, the new np and variable X-bar control charts under repetitive sampling are considered in detail. The in-control and out-of-control average run lengths are analyzed according to various process shifts. The performance of the proposed control charts is compared with the existing np and the X-bar control charts in terms of the average run lengths.

On inference of multivariate means under ranked set sampling

  • Rochani, Haresh;Linder, Daniel F.;Samawi, Hani;Panchal, Viral
    • Communications for Statistical Applications and Methods
    • /
    • 제25권1호
    • /
    • pp.1-13
    • /
    • 2018
  • In many studies, a researcher attempts to describe a population where units are measured for multiple outcomes, or responses. In this paper, we present an efficient procedure based on ranked set sampling to estimate and perform hypothesis testing on a multivariate mean. The method is based on ranking on an auxiliary covariate, which is assumed to be correlated with the multivariate response, in order to improve the efficiency of the estimation. We showed that the proposed estimators developed under this sampling scheme are unbiased, have smaller variance in the multivariate sense, and are asymptotically Gaussian. We also demonstrated that the efficiency of multivariate regression estimator can be improved by using Ranked set sampling. A bootstrap routine is developed in the statistical software R to perform inference when the sample size is small. We use a simulation study to investigate the performance of the method under known conditions and apply the method to the biomarker data collected in China Health and Nutrition Survey (CHNS 2009) data.

Sampling Plans Based on Truncated Life Test for a Generalized Inverted Exponential Distribution

  • Singh, Sukhdev;Tripathi, Yogesh Mani;Jun, Chi-Hyuck
    • Industrial Engineering and Management Systems
    • /
    • 제14권2호
    • /
    • pp.183-195
    • /
    • 2015
  • In this paper, we propose a two-stage group acceptance sampling plan for generalized inverted exponential distribution under truncated life test. Median life is considered as a quality parameter. Design parameters are obtained to ensure that true median life is longer than a given specified life at certain level of consumer's risk and producer's risk. We also explore situations under which design parameters based on median lifetime can be used for other percentile points. Tables and specific examples are reported to explain the proposed plans. Finally a real data set is analyzed to implement the plans in practical situations and some suggestions are given.

샘플링 기법에 의한 작업순서의 결정 (II) (A Study on Determining Job Sequence by Sampling Method (II))

  • 강성수;노인규
    • 산업경영시스템학회지
    • /
    • 제12권19호
    • /
    • pp.25-30
    • /
    • 1989
  • This study is concerned with a job sequencing method using the concept of sampling technique. This sampling technique has never been applied to develop the scheduling algorithms. The most job sequencing algorithms have been developed to determine the best or good solution under the special conditions. Thus, it is not only very difficult, but also taken too much time to develop the appropriate job schedules that satisfy the complex work conditions. The application areas of these algorithms are also very narrow. Under these circumstances it is very desirable to develop a simple job sequencing method which can produce the good solution with the short tine period under any complex work conditions. It is called a sampling job sequencing method in this study. This study is to examine the selection of the good job sequence of 1%-5% upper group by the sampling method. The result shows that there is the set of 0.5%-5% job sequence group which has to same amount of performance measure with the optimal job sequence in the case of experiment of 2/n/F/F max. This indicates that the sampling job sequencing method is a useful job sequencing method to find the optimal or good job sequence with a little effort and time consuming. The results of ANOVA show that the two factors, number of jobs and the range of processing time are the significant factors for determining the job sequence at $\alpha$=0.01. This study is extended to 3 machines to machines job shop problems further.

  • PDF