• 제목/요약/키워드: Validation data set

검색결과 379건 처리시간 0.027초

Comparison of different post-processing techniques in real-time forecast skill improvement

  • Jabbari, Aida;Bae, Deg-Hyo
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 한국수자원학회 2018년도 학술발표회
    • /
    • pp.150-150
    • /
    • 2018
  • The Numerical Weather Prediction (NWP) models provide information for weather forecasts. The highly nonlinear and complex interactions in the atmosphere are simplified in meteorological models through approximations and parameterization. Therefore, the simplifications may lead to biases and errors in model results. Although the models have improved over time, the biased outputs of these models are still a matter of concern in meteorological and hydrological studies. Thus, bias removal is an essential step prior to using outputs of atmospheric models. The main idea of statistical bias correction methods is to develop a statistical relationship between modeled and observed variables over the same historical period. The Model Output Statistics (MOS) would be desirable to better match the real time forecast data with observation records. Statistical post-processing methods relate model outputs to the observed values at the sites of interest. In this study three methods are used to remove the possible biases of the real-time outputs of the Weather Research and Forecast (WRF) model in Imjin basin (North and South Korea). The post-processing techniques include the Linear Regression (LR), Linear Scaling (LS) and Power Scaling (PS) methods. The MOS techniques used in this study include three main steps: preprocessing of the historical data in training set, development of the equations, and application of the equations for the validation set. The expected results show the accuracy improvement of the real-time forecast data before and after bias correction. The comparison of the different methods will clarify the best method for the purpose of the forecast skill enhancement in a real-time case study.

  • PDF

Predictive Spatial Data Fusion Using Fuzzy Object Representation and Integration: Application to Landslide Hazard Assessment

  • Park, No-Wook;Chi, Kwang-Hoon;Chung, Chang-Jo;Kwon, Byung-Doo
    • Korean Journal of Remote Sensing
    • /
    • 제19권3호
    • /
    • pp.233-246
    • /
    • 2003
  • This paper presents a methodology to account for the partial or gradual changes of environmental phenomena in categorical map information for the fusion/integration of multiple spatial data. The fuzzy set based spatial data fusion scheme is applied in order to account for the fuzziness of boundaries in categorical information showing the partial or gradual environmental impacts. The fuzziness or uncertainty of boundary is represented as two kinds of fuzzy membership functions based on fuzzy object concept and the effects of them are quantitatively evaluated with the help of a cross validation procedure. A case study for landslide hazard assessment demonstrates the better performance of this scheme as compared to traditional crisp boundary representation.

Prediction Model for Unpaid Customers Using Big Data (빅 데이터 기반의 체납 수용가 예측 모델)

  • Jeong, Jaean;Lee, Kyouhwan;Jung, Hoekyung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • 제24권7호
    • /
    • pp.827-833
    • /
    • 2020
  • In this paper, to reduce the unpaid rate of local governments, the internal data elements affecting the arrears in Water-INFOS are searched through interviews with meter readers in certain local governments. Candidate data affecting arrears from national statistical data were derived. The influence of the independent variable on the dependent variable was sampled by examining the disorder of the dependent variable in the data set called information gain. We also evaluated the higher prediction rates of decision tree and logistic regression using n-fold cross-validation. The results confirmed that the decision tree can find more accurate customer payment patterns than logistic regression. In the process of developing an analysis algorithm model using machine learning, the optimal values of two environmental variables, the minimum number of data and the maximum purity, which directly affect the complexity and accuracy of the decision tree, are derived to improve the accuracy of the algorithm.

Stability evaluation model for loess deposits based on PCA-PNN

  • Li, Guangkun;Su, Maoxin;Xue, Yiguo;Song, Qian;Qiu, Daohong;Fu, Kang;Wang, Peng
    • Geomechanics and Engineering
    • /
    • 제27권6호
    • /
    • pp.551-560
    • /
    • 2021
  • Due to the low strength and high compressibility characteristics, the loess deposits tunnels are prone to large deformations and collapse. An accurate stability evaluation for loess deposits is of considerable significance in deformation control and safety work during tunnel construction. 37 groups of representative data based on real loess deposits cases were adopted to establish the stability evaluation model for the tunnel project in Yan'an, China. Physical and mechanical indices, including water content, cohesion, internal friction angle, elastic modulus, and poisson ratio are selected as index system on the stability level of loess. The data set is randomly divided into 80% as the training set and 20% as the test set. Firstly, principal component analysis (PCA) is used to convert the five index system to three linearly independent principal components X1, X2 and X3. Then, the principal components were used as input vectors for probabilistic neural network (PNN) to map the nonlinear relationship between the index system and stability level of loess. Furthermore, Leave-One-Out cross validation was applied for the training set to find the suitable smoothing factor. At last, the established model with the target smoothing factor 0.04 was applied for the test set, and a 100% prediction accuracy rate was obtained. This intelligent classification method for loess deposits can be easily conducted, which has wide potential applications in evaluating loess deposits.

Risk-Scoring System for Prediction of Non-Curative Endoscopic Submucosal Dissection Requiring Additional Gastrectomy in Patients with Early Gastric Cancer

  • Kim, Tae-Se;Min, Byung-Hoon;Kim, Kyoung-Mee;Yoo, Heejin;Kim, Kyunga;Min, Yang Won;Lee, Hyuk;Rhee, Poong-Lyul;Kim, Jae J.;Lee, Jun Haeng
    • Journal of Gastric Cancer
    • /
    • 제21권4호
    • /
    • pp.368-378
    • /
    • 2021
  • Purpose: When patients with early gastric cancer (EGC) undergo non-curative endoscopic submucosal dissection requiring gastrectomy (NC-ESD-RG), additional medical resources and expenses are required for surgery. To reduce this burden, predictive model for NC-ESD-RG is required. Materials and Methods: Data from 2,997 patients undergoing ESD for 3,127 forceps biopsy-proven differentiated-type EGCs (2,345 and 782 in training and validation sets, respectively) were reviewed. Using the training set, the logistic stepwise regression analysis determined the independent predictors of NC-ESD-RG (NC-ESD other than cases with lateral resection margin involvement or piecemeal resection as the only non-curative factor). Using these predictors, a risk-scoring system for predicting NC-ESD-RG was developed. Performance of the predictive model was examined internally with the validation set. Results: Rate of NC-ESD-RG was 17.3%. Independent pre-ESD predictors for NC-ESD-RG included moderately differentiated or papillary EGC, large tumor size, proximal tumor location, lesion at greater curvature, elevated or depressed morphology, and presence of ulcers. A risk-score was assigned to each predictor of NC-ESD-RG. The area under the receiver operating characteristic curve for predicting NC-ESD-RG was 0.672 in both training and validation sets. A risk-score of 5 points was the optimal cut-off value for predicting NC-ESD-RG, and the overall accuracy was 72.7%. As the total risk score increased, the predicted risk for NC-ESD-RG increased from 3.8% to 72.6%. Conclusions: We developed and validated a risk-scoring system for predicting NC-ESD-RG based on pre-ESD variables. Our risk-scoring system can facilitate informed consent and decision-making for preoperative treatment selection between ESD and surgery in patients with EGC.

External Validation of a Clinical Scoring System for Hematuria

  • Lee, Seung Bae;Kim, Hyung Suk;Kim, Myong;Ku, Ja Hyeon
    • Asian Pacific Journal of Cancer Prevention
    • /
    • 제15권16호
    • /
    • pp.6819-6822
    • /
    • 2014
  • Background: The aim of this study was to evaluate the accuracy of a new scoring system in Korean patients with hematuria at high risk of bladder cancer. Materials and Methods: A total of 319 consecutive patients presenting with painless hematuria without a history of bladder cancer were analyzed, from the period of August 2012 to February 2014. All patients underwent clinical examination, and 22 patients with incomplete data were excluded from the final validation data set. The scoring system included four clinical parameters: age (${\geq}50$ = 2 vs. <50 =1), gender (male = 2 vs. female = 1), history of smoking (smoker/ex-smoker = 4 vs. non-smoker = 2) and nature of the hematuria (gross = 6 vs. microscopic = 2). Results: The area under the receiver-operating characteristic curve (95% confidence interval) of the scoring system was 0.718 (0.655-0.777). The calibration plot demonstrated a slight underestimation of bladder cancer probability, but the model had reasonable calibration. Decision curve analysis revealed that the use of model was associated with net benefit gains over the treat-all strategy. The scoring system performed well across a wide range of threshold probabilities (15%-45%). Conclusions: The scoring system developed is a highly accurate predictive tool for patients with hematuria. Although further improvements are needed, utilization of this system may assist primary care physicians and other healthcare practitioners in determining a patient's risk of bladder cancer.

A Study on Forecasting Accuracy Improvement of Case Based Reasoning Approach Using Fuzzy Relation (퍼지 관계를 활용한 사례기반추론 예측 정확성 향상에 관한 연구)

  • Lee, In-Ho;Shin, Kyung-Shik
    • Journal of Intelligence and Information Systems
    • /
    • 제16권4호
    • /
    • pp.67-84
    • /
    • 2010
  • In terms of business, forecasting is a work of what is expected to happen in the future to make managerial decisions and plans. Therefore, the accurate forecasting is very important for major managerial decision making and is the basis for making various strategies of business. But it is very difficult to make an unbiased and consistent estimate because of uncertainty and complexity in the future business environment. That is why we should use scientific forecasting model to support business decision making, and make an effort to minimize the model's forecasting error which is difference between observation and estimator. Nevertheless, minimizing the error is not an easy task. Case-based reasoning is a problem solving method that utilizes the past similar case to solve the current problem. To build the successful case-based reasoning models, retrieving the case not only the most similar case but also the most relevant case is very important. To retrieve the similar and relevant case from past cases, the measurement of similarities between cases is an important key factor. Especially, if the cases contain symbolic data, it is more difficult to measure the distances. The purpose of this study is to improve the forecasting accuracy of case-based reasoning approach using fuzzy relation and composition. Especially, two methods are adopted to measure the similarity between cases containing symbolic data. One is to deduct the similarity matrix following binary logic(the judgment of sameness between two symbolic data), the other is to deduct the similarity matrix following fuzzy relation and composition. This study is conducted in the following order; data gathering and preprocessing, model building and analysis, validation analysis, conclusion. First, in the progress of data gathering and preprocessing we collect data set including categorical dependent variables. Also, the data set gathered is cross-section data and independent variables of the data set include several qualitative variables expressed symbolic data. The research data consists of many financial ratios and the corresponding bond ratings of Korean companies. The ratings we employ in this study cover all bonds rated by one of the bond rating agencies in Korea. Our total sample includes 1,816 companies whose commercial papers have been rated in the period 1997~2000. Credit grades are defined as outputs and classified into 5 rating categories(A1, A2, A3, B, C) according to credit levels. Second, in the progress of model building and analysis we deduct the similarity matrix following binary logic and fuzzy composition to measure the similarity between cases containing symbolic data. In this process, the used types of fuzzy composition are max-min, max-product, max-average. And then, the analysis is carried out by case-based reasoning approach with the deducted similarity matrix. Third, in the progress of validation analysis we verify the validation of model through McNemar test based on hit ratio. Finally, we draw a conclusion from the study. As a result, the similarity measuring method using fuzzy relation and composition shows good forecasting performance compared to the similarity measuring method using binary logic for similarity measurement between two symbolic data. But the results of the analysis are not statistically significant in forecasting performance among the types of fuzzy composition. The contributions of this study are as follows. We propose another methodology that fuzzy relation and fuzzy composition could be applied for the similarity measurement between two symbolic data. That is the most important factor to build case-based reasoning model.

Accuracy of genotype imputation based on reference population size and marker density in Hanwoo cattle

  • Lee, DooHo;Kim, Yeongkuk;Chung, Yoonji;Lee, Dongjae;Seo, Dongwon;Choi, Tae Jeong;Lim, Dajeong;Yoon, Duhak;Lee, Seung Hwan
    • Journal of Animal Science and Technology
    • /
    • 제63권6호
    • /
    • pp.1232-1246
    • /
    • 2021
  • Recently, the cattle genome sequence has been completed, followed by developing a commercial single nucleotide polymorphism (SNP) chip panel in the animal genome industry. In order to increase statistical power for detecting quantitative trait locus (QTL), a number of animals should be genotyped. However, a high-density chip for many animals would be increasing the genotyping cost. Therefore, statistical inference of genotype imputation (low-density chip to high-density) will be useful in the animal industry. The purpose of this study is to investigate the effect of the reference population size and marker density on the imputation accuracy and to suggest the appropriate number of reference population sets for the imputation in Hanwoo cattle. A total of 3,821 Hanwoo cattle were divided into reference and validation populations. The reference sets consisted of 50k (38,916) marker data and different population sizes (500, 1,000, 1,500, 2,000, and 3,600). The validation sets consisted of four validation sets (Total 889) and the different marker density (5k [5,000], 10k [10,000], and 15k [15,000]). The accuracy of imputation was calculated by direct comparison of the true genotype and the imputed genotype. In conclusion, when the lowest marker density (5k) was used in the validation set, according to the reference population size, the imputation accuracy was 0.793 to 0.929. On the other hand, when the highest marker density (15k), according to the reference population size, the imputation accuracy was 0.904 to 0.967. Moreover, the reference population size should be more than 1,000 to obtain at least 88% imputation accuracy in Hanwoo cattle.

An Effective Clustering Procedure for Quantitative Data and Its Application for the Grouping of the Reusable Nuclear Fuel (정량적 자료에 대한 효과적인 군집화 과정 및 사용 후 핵연료의 분류에의 적용)

  • Jing, Jin-Xi;Yoon, Bok-Sik;Lee, Yong-Joo
    • IE interfaces
    • /
    • 제15권2호
    • /
    • pp.182-188
    • /
    • 2002
  • Clustering is widely used in various fields in order to investigate structural characteristics of the given data. One of the main tasks of clustering is to partition a set of objects into homogeneous groups for the purpose of data reduction. In this paper a simple but computationally efficient clustering procedure is devised and some statistical techniques to validate its clustered results are discussed. In the given procedure, the proper number of clusters and the clustered groups can be determined simultaneously. The whole procedure is applied to a practical clustering problem for the classification of reusable fuels in nuclear power plants.

Sequence driven features for prediction of subcellular localization of proteins

  • Kim, Jong-Kyoung;Bang, Sung-Yang;Choi, Seung-Jin
    • Proceedings of the Korean Society for Bioinformatics Conference
    • /
    • 한국생물정보시스템생물학회 2005년도 BIOINFO 2005
    • /
    • pp.237-242
    • /
    • 2005
  • Predicting the cellular location of an unknown protein gives a valuable information for inferring the possible function of the protein. For more accurate prediction system, we need a good feature extraction method that transforms the raw sequence data into the numerical feature vector, minimizing information loss. In this paper, we propose new methods of extracting underlying features only from the sequence data by computing pairwise sequence alignment scores. In addition, we use composition based features to improve prediction accuracy. To construct an SVM ensemble from separately trained SVM classifiers, we propose specificity based weighted majority voting. The overall prediction accuracy evaluated by the 5-fold cross-validation reached 88.53% for the eukaryotic animal data set. By comparing the prediction accuracy of various feature extraction methods, we could get the biological insight on the location of targeting information. Our numerical experiments confirm that our new feature extraction methods are very useful for predicting subcellular localization of proteins.

  • PDF