• 제목/요약/키워드: Validation data set

검색결과 378건 처리시간 0.027초

Fixed size LS-SVM for multiclassification problems of large data sets

  • Hwang, Hyung-Tae
    • Journal of the Korean Data and Information Science Society
    • /
    • 제21권3호
    • /
    • pp.561-567
    • /
    • 2010
  • Multiclassification is typically performed using voting scheme methods based on combining a set of binary classifications. In this paper we use multiclassification method with a hat matrix of least squares support vector machine (LS-SVM), which can be regarded as the revised one-against-all method. To tackle multiclass problems for large data, we use the $Nystr\ddot{o}m$ approximation and the quadratic Renyi entropy with estimation in the primal space such as used in xed size LS-SVM. For the selection of hyperparameters, generalized cross validation techniques are employed. Experimental results are then presented to indicate the performance of the proposed procedure.

A Study on the Prediction of Community Smart Pension Intention Based on Decision Tree Algorithm

  • Liu, Lijuan;Min, Byung-Won
    • International Journal of Contents
    • /
    • 제17권4호
    • /
    • pp.79-90
    • /
    • 2021
  • With the deepening of population aging, pension has become an urgent problem in most countries. Community smart pension can effectively resolve the problem of traditional pension, as well as meet the personalized and multi-level needs of the elderly. To predict the pension intention of the elderly in the community more accurately, this paper uses the decision tree classification method to classify the pension data. After missing value processing, normalization, discretization and data specification, the discretized sample data set is obtained. Then, by comparing the information gain and information gain rate of sample data features, the feature ranking is determined, and the C4.5 decision tree model is established. The model performs well in accuracy, precision, recall, AUC and other indicators under the condition of 10-fold cross-validation, and the precision was 89.5%, which can provide the certain basis for government decision-making.

신용평가모형에서 두 분포함수의 동일성 검정을 위한 비모수적인 검정방법 (Nonparametric homogeneity tests of two distributions for credit rating model validation)

  • 홍종선;김지훈
    • Journal of the Korean Data and Information Science Society
    • /
    • 제20권2호
    • /
    • pp.261-272
    • /
    • 2009
  • 신용평가모형에서 두 집단의 판별력 검정방법 중의 하나로 두 분포함수의 동일성 검정을 위한 비모수적인 Kolmogorov-Smirnov (K-S) 검정방법이 대표적으로 적용되고 있다. 본 연구에서는 신용평가모형에서 두 분포함수의 동일성 검정을 위하여 K-S 검정 방법 외에 Cramer-Von Mises, Anderson-Darling, Watson 검정방법들을 소개하고 Joseph (2005)의 기준에 대응하는 판단기준을 제안한다. 또한 신용평가 자료와 유사한 상황 하에서의 모의실험을 통해서 불량률, 표본크기 그리고 제II종 오류율을 고려한 대안적인 판단기준을 제시하고 그 적용방법에 대해서 살펴본다.

  • PDF

Development of the Algorithm for Optimizing Wavelength Selection in Multiple Linear Regression

  • Hoeil Chung
    • Near Infrared Analysis
    • /
    • 제1권1호
    • /
    • pp.1-7
    • /
    • 2000
  • A convenient algorithm for optimizing wavelength selection in multiple linear regression (MLR) has been developed. MOP (MLP Optimization Program) has been developed to test all possible MLR calibration models in a given spectral range and finally find an optimal MLR model with external validation capability. MOP generates all calibration models from all possible combinations of wavelength, and simultaneously calculates SEC (Standard Error of Calibration) and SEV (Standard Error of Validation) by predicting samples in a validation data set. Finally, with determined SEC and SEV, it calculates another parameter called SAD (Sum of SEC, SEV, and Absolute Difference between SEC and SEV: sum(SEC+SEV+Abs(SEC-SEV)). SAD is an useful parameter to find an optimal calibration model without over-fitting by simultaneously evaluating SEC, SEV, and difference of error between calibration and validation. The calibration model corresponding to the smallest SAD value is chosen as an optimum because the errors in both calibration and validation are minimal as well as similar in scale. To evaluate the capability of MOP, the determination of benzene content in unleaded gasoline has been examined. MOP successfully found the optimal calibration model and showed the better calibration and independent prediction performance compared to conventional MLR calibration.

On validation of fully coupled behavior of porous media using centrifuge test results

  • Tasiopoulou, Panagiota;Taiebat, Mahdi;Tafazzoli, Nima;Jeremic, Boris
    • Coupled systems mechanics
    • /
    • 제4권1호
    • /
    • pp.37-65
    • /
    • 2015
  • Modeling and simulation of mechanical response of infrastructure object, solids and structures, relies on the use of computational models to foretell the state of a physical system under conditions for which such computational model has not been validated. Verification and Validation (V&V) procedures are the primary means of assessing accuracy, building confidence and credibility in modeling and computational simulations of behavior of those infrastructure objects. Validation is the process of determining a degree to which a model is an accurate representation of the real world from the perspective of the intended uses of the model. It is mainly a physics issue and provides evidence that the correct model is solved (Oberkampf et al. 2002). Our primary interest is in modeling and simulating behavior of porous particulate media that is fully saturated with pore fluid, including cyclic mobility and liquefaction. Fully saturated soils undergoing dynamic shaking fall in this category. Verification modeling and simulation of fully saturated porous soils is addressed in more detail by (Tasiopoulou et al. 2014), and in this paper we address validation. A set of centrifuge experiments is used for this purpose. Discussion is provided assessing the effects of scaling laws on centrifuge experiments and their influence on the validation. Available validation test are reviewed in view of first and second order phenomena and their importance to validation. For example, dynamics behavior of the system, following the dynamic time, and dissipation of the pore fluid pressures, following diffusion time, are not happening in the same time scale and those discrepancies are discussed. Laboratory tests, performed on soil that is used in centrifuge experiments, were used to calibrate material models that are then used in a validation process. Number of physical and numerical examples are used for validation and to illustrate presented discussion. In particular, it is shown that for the most part, numerical prediction of behavior, using laboratory test data to calibrate soil material model, prior to centrifuge experiments, can be validated using scaled tests. There are, of course, discrepancies, sources of which are analyzed and discussed.

Design of weighted federated learning framework based on local model validation

  • Kim, Jung-Jun;Kang, Jeon Seong;Chung, Hyun-Joon;Park, Byung-Hoon
    • 한국컴퓨터정보학회논문지
    • /
    • 제27권11호
    • /
    • pp.13-18
    • /
    • 2022
  • 본 논문에서는 학습에 참여하는 각 디바이스의 모델들로부터 성능검증에 따라 가중치를 두어 글로벌 모델을 업데이트하는 VW-FedAVG(Validation based Weighted FedAVG)를 두 가지 방식으로 제안 한다. 첫 번째 방식은 서버 검증(Server side Validation) 구조로 글로벌 모델을 업데이트 하기 전에 각 로컬 클라이언트 모델을 하나의 전체 검증 데이터셋을 통해 검증하도록 설계 했다. 두 번째는 클라이언트 검증(Client side Validation) 구조로 검증 데이터셋을 각 클라이언트에 고르게 분배하여 검증을 한 후 글로벌 모델을 업데이트 하는 방식으로 설계 했다. 전체 실험에 적용한 데이터셋은 MNIST, CIFAR-10으로 이미지 분류에 대해 IID, Non-IID 분포에서 기존 연구 대비 더 높은 정확도를 얻을 수 있었다.

Development of kNN QSAR Models for 3-Arylisoquinoline Antitumor Agents

  • Tropsha, Alexander;Golbraikh, Alexander;Cho, Won-Jea
    • Bulletin of the Korean Chemical Society
    • /
    • 제32권7호
    • /
    • pp.2397-2404
    • /
    • 2011
  • Variable selection k nearest neighbor QSAR modeling approach was applied to a data set of 80 3-arylisoquinolines exhibiting cytotoxicity against human lung tumor cell line (A-549). All compounds were characterized with molecular topology descriptors calculated with the MolconnZ program. Seven compounds were randomly selected from the original dataset and used as an external validation set. The remaining subset of 73 compounds was divided into multiple training (56 to 61 compounds) and test (17 to 12 compounds) sets using a chemical diversity sampling method developed in this group. Highly predictive models characterized by the leave-one out cross-validated $R^2$ ($q^2$) values greater than 0.8 for the training sets and $R^2$ values greater than 0.7 for the test sets have been obtained. The robustness of models was confirmed by the Y-randomization test: all models built using training sets with randomly shuffled activities were characterized by low $q^2{\leq}0.26$ and $R^2{\leq}0.22$ for training and test sets, respectively. Twelve best models (with the highest values of both $q^2$ and $R^2$) predicted the activities of the external validation set of seven compounds with $R^2$ ranging from 0.71 to 0.93.

Cross-Validation Probabilistic Neural Network Based Face Identification

  • Lotfi, Abdelhadi;Benyettou, Abdelkader
    • Journal of Information Processing Systems
    • /
    • 제14권5호
    • /
    • pp.1075-1086
    • /
    • 2018
  • In this paper a cross-validation algorithm for training probabilistic neural networks (PNNs) is presented in order to be applied to automatic face identification. Actually, standard PNNs perform pretty well for small and medium sized databases but they suffer from serious problems when it comes to using them with large databases like those encountered in biometrics applications. To address this issue, we proposed in this work a new training algorithm for PNNs to reduce the hidden layer's size and avoid over-fitting at the same time. The proposed training algorithm generates networks with a smaller hidden layer which contains only representative examples in the training data set. Moreover, adding new classes or samples after training does not require retraining, which is one of the main characteristics of this solution. Results presented in this work show a great improvement both in the processing speed and generalization of the proposed classifier. This improvement is mainly caused by reducing significantly the size of the hidden layer.

순차적으로 선택된 특성과 유전 프로그래밍을 이용한 결정나무 (A Decision Tree Induction using Genetic Programming with Sequentially Selected Features)

  • 김효중;박종선
    • 경영과학
    • /
    • 제23권1호
    • /
    • pp.63-74
    • /
    • 2006
  • Decision tree induction algorithm is one of the most widely used methods in classification problems. However, they could be trapped into a local minimum and have no reasonable means to escape from it if tree algorithm uses top-down search algorithm. Further, if irrelevant or redundant features are included in the data set, tree algorithms produces trees that are less accurate than those from the data set with only relevant features. We propose a hybrid algorithm to generate decision tree that uses genetic programming with sequentially selected features. Correlation-based Feature Selection (CFS) method is adopted to find relevant features which are fed to genetic programming sequentially to find optimal trees at each iteration. The new proposed algorithm produce simpler and more understandable decision trees as compared with other decision trees and it is also effective in producing similar or better trees with relatively smaller set of features in the view of cross-validation accuracy.

행정정보 데이터세트 보존포맷으로서 SIARD 검증에 관한 연구 (A Study on SIARD Verification as a Preservation Format for Data Set Records)

  • 윤성호;이정은;양동민
    • 한국기록관리학회지
    • /
    • 제21권3호
    • /
    • pp.99-118
    • /
    • 2021
  • 4차 산업혁명의 도래로 데이터의 중요성이 커지는 상황에 따라, 해외 각국은 데이터 장기보존 기술 연구를 추진하고 있다. 반면 우리나라는 행정정보 데이터세트가 기록관리 영역으로 법제화됐으나, 구체적인 장기보존 방안이 부재한 상황이다. 이에 본 연구는 여러 선행연구에서 행정정보 데이터세트 보존포맷으로 제안된 SIARD(Software Independent Archiving of Relational Database)에 대한 기초, 교차 검증 시험을 수행했다. 먼저 기초 검증 시험은 SIARD 포맷이 보존할 수 있는 데이터세트의 데이터, 구조, 기능 등을 도출하는데 방점을 두었다. 두 번째 교차 검증 시험은 DBMS 종류에 구애받지 않는 SIARD의 상호호환성 검증에 목적을 두었다. 2차례 검증 시험 결과, SIARD 포맷으로 JSON, UROWID 데이터 타입, FK(Foreign Key), 함수 계열 요소를 보존할 수 없으며, SIARD 2.0 표준에 명시된 기능과 실제 SIARD Suite이 제공하는 기능에 차이가 있음을 확인하였다. 본 연구는 실증적 검증 시험을 진행했으며, SIARD Suite의 기능을 보완하는 개발 방안과 SIARD Suite을 국내 환경에 맞춰 효율적으로 개발할 수 있는 방향성을 제시했다는 점에서 의의가 있다.