• Title/Summary/Keyword: data sampling

Search Result 5,013, Processing Time 0.048 seconds

Comparison of Latin Hypercube Sampling and Simple Random Sampling Applied to Neural Network Modeling of HfO2 Thin Film Fabrication

  • Lee, Jung-Hwan;Ko, Young-Don;Yun, Il-Gu;Han, Kyong-Hee
    • Transactions on Electrical and Electronic Materials
    • /
    • v.7 no.4
    • /
    • pp.210-214
    • /
    • 2006
  • In this paper, two sampling methods which are Latin hypercube sampling (LHS) and simple random sampling were. compared to improve the modeling speed of neural network model. Sampling method was used to generate initial weights and bias set. Electrical characteristic data for $HfO_2$ thin film was used as modeling data. 10 initial parameter sets which are initial weights and bias sets were generated using LHS and simple random sampling, respectively. Modeling was performed with generated initial parameters and measured epoch number. The other network parameters were fixed. The iterative 20 minimum epoch numbers for LHS and simple random sampling were analyzed by nonparametric method because of their nonnormality.

COMPARISON OF SUB-SAMPLING ALGORITHM FOR LRIT IMAGE GENERATION

  • Bae, Hee-Jin;Ahn, Sang-Il
    • Proceedings of the KSRS Conference
    • /
    • 2007.10a
    • /
    • pp.109-113
    • /
    • 2007
  • The COMS provides the LRIT/HRIT services to users. The COMS LRIT/HRIT broadcast service should satisfy the 15 minutes timeliness requirement. The requirement is important and critical enough to impact overall performance of the LHGS. HRIT image data is acquired from INRSM output receiving but LRIT image data is generated by sub-sampling HRIT image data in the LHGS. Specially, since LRIT is acquired from sub-sampled HRIT image data, LRIT processing spent more time. Besides, some of data loss for LRIT occurs since LRIT is compressed by lossy JPEG. Therefore, algorithm with the fastest processing speed and simplicity to be implemented should be selected to satisfy the requirement. Investigated sub-sampling algorithm for the LHGS were nearest neighbour algorithm, bilinear algorithm and bicubic algorithm. Nearest neighbour algorithm is selected for COMS LHGS considering the speed, simplicity and anti-aliasing corresponding to the guideline of user (KMA: Korea Meteorological Administration) to maintain the most cloud itself information in a view of meteorology. But the nearest neighbour algorithm is known as the worst performance. Therefore, it is studied in this paper that the selection of nearest neighbour algorithm for the LHGS is reasonable. First of all, characteristic of 3 sub-sampling algorithms is studied and compared. Then, several sub-sampling algorithm were applied to MTSAT-1R image data corresponding to COMS HRIT. Also, resized image was acquired from sub-sampled image with the identical sub-sampling algorithms applied to sub-sampling from HRIT to LRIT. And the difference between original image and resized image is compared. Besides, PSNR and MSE are calculated for each algorithm. This paper shows that it is appropriate to select nearest neighbour algorithm for COMS LHGS since sub-sampled image by nearest neighbour algorithm is little difference with that of other algorithms in quality performance from PSNR.

  • PDF

Support Vector Machine based on Stratified Sampling

  • Jun, Sung-Hae
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.9 no.2
    • /
    • pp.141-146
    • /
    • 2009
  • Support vector machine is a classification algorithm based on statistical learning theory. It has shown many results with good performances in the data mining fields. But there are some problems in the algorithm. One of the problems is its heavy computing cost. So we have been difficult to use the support vector machine in the dynamic and online systems. To overcome this problem we propose to use stratified sampling of statistical sampling theory. The usage of stratified sampling supports to reduce the size of training data. In our paper, though the size of data is small, the performance accuracy is maintained. We verify our improved performance by experimental results using data sets from UCI machine learning repository.

A Comparison of Ensemble Methods Combining Resampling Techniques for Class Imbalanced Data (데이터 전처리와 앙상블 기법을 통한 불균형 데이터의 분류모형 비교 연구)

  • Leea, Hee-Jae;Lee, Sungim
    • The Korean Journal of Applied Statistics
    • /
    • v.27 no.3
    • /
    • pp.357-371
    • /
    • 2014
  • There are many studies related to imbalanced data in which the class distribution is highly skewed. To address the problem of imbalanced data, previous studies deal with resampling techniques which correct the skewness of the class distribution in each sampled subset by using under-sampling, over-sampling or hybrid-sampling such as SMOTE. Ensemble methods have also alleviated the problem of class imbalanced data. In this paper, we compare around a dozen algorithms that combine the ensemble methods and resampling techniques based on simulated data sets generated by the Backbone model, which can handle the imbalance rate. The results on various real imbalanced data sets are also presented to compare the effectiveness of algorithms. As a result, we highly recommend the resampling technique combining ensemble methods for imbalanced data in which the proportion of the minority class is less than 10%. We also find that each ensemble method has a well-matched sampling technique. The algorithms which combine bagging or random forest ensembles with random undersampling tend to perform well; however, the boosting ensemble appears to perform better with over-sampling. All ensemble methods combined with SMOTE outperform in most situations.

Feasibility Study on Sampling Ocean Meteorological Data using Stratified Method (층화추출법에 의한 해양기상환경의 표본추출 타당성 연구)

  • Han, Song-I;Cho, Yong-Jin
    • Journal of Ocean Engineering and Technology
    • /
    • v.28 no.3
    • /
    • pp.254-259
    • /
    • 2014
  • The infrared signature of a ship is largely influenced by the ocean environment of the operating area, which has been known to cause large changes in the signature. As a result, the weather condition has to be clearly set for an analysis of the infrared signatures. It is necessary to analyze meteorological data for all the oceans where the ship is supposed to be operated. This is impossibly costly and time consuming because of the huge size of the data. Therefore, the creation of a standard environmental variable for an infrared signature research is necessary. In this study, we compared and analyzed sampling methods to represent ocean data close to the Korean peninsula. In order to perform this research, we collected ocean meteorological records from KMA (Korea Meteorological Administration), and sampled these in numerous ways considering five variables that are known to affect the infrared signature. Specifically, a simple random sampling method for all the data and 1-D, 2-D, and 3-D stratified sampling methods were compared and analyzed by considering the mean square errors for each method.

Anomaly Detection In Real Power Plant Vibration Data by MSCRED Base Model Improved By Subset Sampling Validation (Subset 샘플링 검증 기법을 활용한 MSCRED 모델 기반 발전소 진동 데이터의 이상 진단)

  • Hong, Su-Woong;Kwon, Jang-Woo
    • Journal of Convergence for Information Technology
    • /
    • v.12 no.1
    • /
    • pp.31-38
    • /
    • 2022
  • This paper applies an expert independent unsupervised neural network learning-based multivariate time series data analysis model, MSCRED(Multi-Scale Convolutional Recurrent Encoder-Decoder), and to overcome the limitation, because the MCRED is based on Auto-encoder model, that train data must not to be contaminated, by using learning data sampling technique, called Subset Sampling Validation. By using the vibration data of power plant equipment that has been labeled, the classification performance of MSCRED is evaluated with the Anomaly Score in many cases, 1) the abnormal data is mixed with the training data 2) when the abnormal data is removed from the training data in case 1. Through this, this paper presents an expert-independent anomaly diagnosis framework that is strong against error data, and presents a concise and accurate solution in various fields of multivariate time series data.

On sampling algorithms for imbalanced binary data: performance comparison and some caveats (불균형적인 이항 자료 분석을 위한 샘플링 알고리즘들: 성능비교 및 주의점)

  • Kim, HanYong;Lee, Woojoo
    • The Korean Journal of Applied Statistics
    • /
    • v.30 no.5
    • /
    • pp.681-690
    • /
    • 2017
  • Various imbalanced binary classification problems exist such as fraud detection in banking operations, detecting spam mail and predicting defective products. Several sampling methods such as over sampling, under sampling, SMOTE have been developed to overcome the poor prediction performance of binary classifiers when the proportion of one group is dominant. In order to overcome this problem, several sampling methods such as over-sampling, under-sampling, SMOTE have been developed. In this study, we investigate prediction performance of logistic regression, Lasso, random forest, boosting and support vector machine in combination with the sampling methods for binary imbalanced data. Four real data sets are analyzed to see if there is a substantial improvement in prediction performance. We also emphasize some precautions when the sampling methods are implemented.

Sampling Bias of Discontinuity Orientation Measurements for Rock Slope Design in Linear Sampling Technique : A Case Study of Rock Slopes in Western North Carolina (선형 측정 기법에 의해 발생하는 불연속면 방향성의 왜곡 : 서부 North Carolina의 암반 사면에서의 예)

  • 박혁진
    • Journal of the Korean Geotechnical Society
    • /
    • v.16 no.1
    • /
    • pp.145-155
    • /
    • 2000
  • Orientation data of discontinuities are of paramount importance for rock slope stability studies because they control the possibility of unstable conditions or excessive deformation. Most orientation data are collected by using linear sampling techniques, such as borehole fracture mapping and the detailed scanline method (outcrop mapping). However, these data, acquired by the above linear sampling techniques, are subjected to bias, owing to the orientation of the sampling line. Even though a weighting factor is applied to orientation data in order to reduce this bias, the bias will not be significantly reduced when certain sampling orientations are involved. That is, if the linear sampling orientation nearly parallels the discontinuity orientation, most discontinuities orientation data which are parallel to sampling line will be excluded from the survey result. This phenomenon can cause serious misinterpretation of discontinuity orientation data because critical information is omitted. In the case study, orientation data collected by using the borehole fracture mapping method (vertical scanline) were compared to those based on orientation data from the detailed scanline method (horizontal scanline). Differences in results for the two procedures revealed a concern that a representative orientation of discontinuities was not accomplished. Equal-area, polar stereo nets were used to determine the distribution of dip angles and to compare the data distribution fur the borehole method versus those for the scanline method.

  • PDF

Modified n-Level Skip-Lot Sampling Inspection Plans

  • Cho, Gyo-Young
    • Journal of the Korean Data and Information Science Society
    • /
    • v.19 no.3
    • /
    • pp.811-818
    • /
    • 2008
  • This paper is the generalization of the modified two-level skip-lot sampling plan(MTSkSP2) to n-level. The general formulas of the operating characteristic(OC) function, average sample number(ASN) and average outgoing quality(AOQ) for the plan are derived using Markov chain properties.

  • PDF

A study on the Spatial Sampling Method to Minimize Spatial Autocorrelation of Spatial and Geographical Data (공간·지리적 자료의 공간자기상관성을 최소화하는 공간샘플링 기법에 관한 연구)

  • Lee, Youn Soo;Lee, Man Choul;Lah, Kyung Beom;Kang, Jun Mo
    • KSCE Journal of Civil and Environmental Engineering Research
    • /
    • v.34 no.4
    • /
    • pp.1317-1325
    • /
    • 2014
  • The study focused on analyzing spatial sampling by minimizing autocorrelation of spatial based on spatial and geographical data. The study concluded two different ways of minimizing autocorrelation. First, it was important to use suitable spatial sampling method to alienate spatial autocorrelation from spatial or geographical data. The shear distribution rate of public transportation in Seoul resulted in high rate of autocorrelation. However, the study showed samples eliminated autocorrelation when samples were extracted with reasonable distance(above 400m) apart. Without spatial sampling the distortion of spatial data leads to false results; therefore, spatial sampling is indispensable. Second, factors which fluctuates shear distribution of public transportation spatial sampling changed before and after spatial sampling. This was caused by incapable of controling inherent spatial autocorrelation of the data.