• Title/Summary/Keyword: Generalized cross-validation

Search Result 76, Processing Time 0.023 seconds

Minimum Message Length and Classical Methods for Model Selection in Univariate Polynomial Regression

  • Viswanathan, Murlikrishna;Yang, Young-Kyu;WhangBo, Taeg-Keun
    • ETRI Journal
    • /
    • v.27 no.6
    • /
    • pp.747-758
    • /
    • 2005
  • The problem of selection among competing models has been a fundamental issue in statistical data analysis. Good fits to data can be misleading since they can result from properties of the model that have nothing to do with it being a close approximation to the source distribution of interest (for example, overfitting). In this study we focus on the preference among models from a family of polynomial regressors. Three decades of research has spawned a number of plausible techniques for the selection of models, namely, Akaike's Finite Prediction Error (FPE) and Information Criterion (AIC), Schwartz's criterion (SCH), Generalized Cross Validation (GCV), Wallace's Minimum Message Length (MML), Minimum Description Length (MDL), and Vapnik's Structural Risk Minimization (SRM). The fundamental similarity between all these principles is their attempt to define an appropriate balance between the complexity of models and their ability to explain the data. This paper presents an empirical study of the above principles in the context of model selection, where the models under consideration are univariate polynomials. The paper includes a detailed empirical evaluation of the model selection methods on six target functions, with varying sample sizes and added Gaussian noise. The results from the study appear to provide strong evidence in support of the MML- and SRM- based methods over the other standard approaches (FPE, AIC, SCH and GCV).

  • PDF

Acoustic holography for an engine radiation noise using equivalent sources (등가음원을 이용한 엔진 방사 소음의 음향 홀로그래피에 대한 연구)

  • Jeon, In-Youl;Ih, Jeong-Guon
    • Proceedings of the Korean Society for Noise and Vibration Engineering Conference
    • /
    • 2004.11a
    • /
    • pp.1101-1106
    • /
    • 2004
  • This study presents the reconstruction of sound field radiated from an automotive engine using equivalent sources. Basic concept of the method presented is to replace the engine noise source with elementary sources of multipoles, e.g., monopoles and dipoles. The so-called Helmholtz equation least-squares (HELS) method can reconstruct the sound radiation fields from spherical geometries in a series expansion of spherical Hankel functions and spherical harmonics. In this paper, multi-Point, multipole equivalent sources are employed to reconstruct the sound field radiated from an automotive engine with a fixed rotation speed. To ensure and improve the accuracy of reconstruction, the spatial filters of multipole coefficients and wave-vectors are adopted for suppressing the adverse effect of high-order multipoles. Optimal filter shapes are designed with regularization parameters minimizing the generalized cross validation (GCV) function between actual and reproduced model. After regeneration of field pressures using the proposed method as many as necessary, the vibro-acoustic field of an engine could be reconstructed by using the BEM-based near-field acoustic holography (NAH) technique in a cost-effective manner.

  • PDF

Machine learning approaches for wind speed forecasting using long-term monitoring data: a comparative study

  • Ye, X.W.;Ding, Y.;Wan, H.P.
    • Smart Structures and Systems
    • /
    • v.24 no.6
    • /
    • pp.733-744
    • /
    • 2019
  • Wind speed forecasting is critical for a variety of engineering tasks, such as wind energy harvesting, scheduling of a wind power system, and dynamic control of structures (e.g., wind turbine, bridge, and building). Wind speed, which has characteristics of random, nonlinear and uncertainty, is difficult to forecast. Nowadays, machine learning approaches (generalized regression neural network (GRNN), back propagation neural network (BPNN), and extreme learning machine (ELM)) are widely used for wind speed forecasting. In this study, two schemes are proposed to improve the forecasting performance of machine learning approaches. One is that optimization algorithms, i.e., cross validation (CV), genetic algorithm (GA), and particle swarm optimization (PSO), are used to automatically find the optimal model parameters. The other is that the combination of different machine learning methods is proposed by finite mixture (FM) method. Specifically, CV-GRNN, GA-BPNN, PSO-ELM belong to optimization algorithm-assisted machine learning approaches, and FM is a hybrid machine learning approach consisting of GRNN, BPNN, and ELM. The effectiveness of these machine learning methods in wind speed forecasting are fully investigated by one-year field monitoring data, and their performance is comprehensively compared.

Functional clustering for electricity demand data: A case study (시간단위 전력수요자료의 함수적 군집분석: 사례연구)

  • Yoon, Sanghoo;Choi, Youngjean
    • Journal of the Korean Data and Information Science Society
    • /
    • v.26 no.4
    • /
    • pp.885-894
    • /
    • 2015
  • It is necessary to forecast the electricity demand for reliable and effective operation of the power system. In this study, we try to categorize a functional data, the mean curve in accordance with the time of daily power demand pattern. The data were collected between January 1, 2009 and December 31, 2011. And it were converted to time series data consisting of seasonal components and error component through log transformation and removing trend. Functional clustering by Ma et al. (2006) are applied and parameters are estimated using EM algorithm and generalized cross validation. The number of clusters is determined by classifying holidays or weekdays. Monday, weekday (Tuesday to Friday), Saturday, Sunday or holiday and season are described the mean curve of daily power demand pattern.

Utilization Evaluation of Numerical forest Soil Map to Predict the Weather in Upland Crops (밭작물 농업기상을 위한 수치형 산림입지토양도 활용성 평가)

  • Kang, Dayoung;Hwang, Yeongeun;Yoon, Sanghoo
    • Korean Journal of Agricultural and Forest Meteorology
    • /
    • v.23 no.1
    • /
    • pp.34-45
    • /
    • 2021
  • Weather is one of the important factors in the agricultural industry as it affects the price, production, and quality of crops. Upland crops are directly exposed to the natural environment because they are mainly grown in mountainous areas. Therefore, it is necessary to provide accurate weather for upland crops. This study examined the effectiveness of 12 forest soil factors to interpolate the weather in mountainous areas. The daily temperature and precipitation were collected by the Korea Meteorological Administration between January 2009 and December 2018. The Generalized Additive Model (GAM), Kriging, and Random Forest (RF) were considered to interpolate. For evaluating the interpolation performance, automatic weather stations were used as training data and automated synoptic observing systems were used as test data for cross-validation. Unfortunately, the forest soil factors were not significant to interpolate the weather in the mountainous areas. GAM with only geography aspects showed that it can interpolate well in terms of root mean squared error and mean absolute error. The significance of the factors was tested at the 5% significance level in GAM, and the climate zone code (CLZN_CD) and soil water code B (SIBFLR_LAR) were identified as relatively important factors. It has shown that CLZN_CD could help to interpolate the daily average and minimum daily temperature for upland crops.

Calpain-10 SNP43 and SNP19 Polymorphisms and Colorectal Cancer: a Matched Case-control Study

  • Hu, Xiao-Qin;Yuan, Ping;Luan, Rong-Sheng;Li, Xiao-Ling;Liu, Wen-Hui;Feng, Fei;Yan, Jin;Yang, Yan-Fang
    • Asian Pacific Journal of Cancer Prevention
    • /
    • v.14 no.11
    • /
    • pp.6673-6680
    • /
    • 2013
  • Objective: Insulin resistance (IR) is an established risk factor for colorectal cancer (CRC). Given that CRC and IR physiologically overlap and the calpain-10 gene (CAPN10) is a candidate for IR, we explored the association between CAPN10 and CRC risk. Methods: Blood samples of 400 case-control pairs were genotyped, and the lifestyle and dietary habits of these pairs were recorded and collected. Unconditional logistic regression (LR) was used to assess the effects of CAPN10 SNP43 and SNP19, and environmental factors. Both generalized multifactor dimensionality reduction (GMDR) and the classification and regression tree (CART) were used to test gene-environment interactions for CRC risk. Results: The GA+AA genotype of SNP43 and the Del/Ins+Ins/Ins genotype of SNP19 were marginally related to CRC risk (GA+AA: OR = 1.35, 95% CI = 0.92-1.99; Del/Ins+Ins/Ins: OR = 1.31, 95% CI = 0.84-2.04). Notably, a high-order interaction was consistently identified by GMDR and CART analyses. In GMDR, the four-factor interaction model of SNP43, SNP19, red meat consumption, and smoked meat consumption was the best model, with a maximum cross-validation consistency of 10/10 and testing balance accuracy of 0.61 (P < 0.01). In LR, subjects with high red and smoked meat consumption and two risk genotypes had a 6.17-fold CRC risk (95% CI = 2.44-15.6) relative to that of subjects with low red and smoked meat consumption and null risk genotypes. In CART, individuals with high smoked and red meat consumption, SNP19 Del/Ins+Ins/Ins, and SNP43 GA+AA had higher CRC risk (OR = 4.56, 95%CI = 1.94-10.75) than those with low smoked and red meat consumption. Conclusions: Though the single loci of CAPN10 SNP43 and SNP19 are not enough to significantly increase the CRC susceptibility, the combination of SNP43, SNP19, red meat consumption, and smoked meat consumption is associated with elevated risk.