Browse > Article
http://dx.doi.org/10.5762/KAIS.2019.20.2.200

Predictive Optimization Adjusted With Pseudo Data From A Missing Data Imputation Technique  

Kim, Jeong-Woo (Asan Institute for Life Sciences)
Publication Information
Journal of the Korea Academia-Industrial cooperation Society / v.20, no.2, 2019 , pp. 200-209 More about this Journal
Abstract
When forecasting future values, a model estimated after minimizing training errors can yield test errors higher than the training errors. This result is the over-fitting problem caused by an increase in model complexity when the model is focused only on a given dataset. Some regularization and resampling methods have been introduced to reduce test errors by alleviating this problem but have been designed for use with only a given dataset. In this paper, we propose a new optimization approach to reduce test errors by transforming a test error minimization problem into a training error minimization problem. To carry out this transformation, we needed additional data for the given dataset, termed pseudo data. To make proper use of pseudo data, we used three types of missing data imputation techniques. As an optimization tool, we chose the least squares method and combined it with an extra pseudo data instance. Furthermore, we present the numerical results supporting our proposed approach, which resulted in less test errors than the ordinary least squares method.
Keywords
bias and variance; prediction; missing data imputation; overfitting; pseudo data; test error and training error;
Citations & Related Records
연도 인용수 순위
  • Reference
1 H. Akaike, "Information theory and an extension of the maximum likelihood principle," in Selected papers of hirotugu akaike, ed: Springer, pp. 199-213, 1998. DOI: https://doi.org/10.1007/978-1-4612-1694-0_15
2 M. J. Garside, "The Best Subset in Multiple Regression Analysis," Applied Statistics, Vol. 14, pp. 196-200, 1965. DOI: https://doi.org/10.2307/2985341   DOI
3 M. G. Kendall, A course in multivariate analysis, C, Griffin, London, pp. 23-29, 1957.
4 H. Hotelling, "The relations of the newer multivariate statistical methods to factor analysis," British Journal of Statistical Psychology, Vol. 10, pp. 69-79, 1957. DOI: https://doi.org/10.1111/j.2044-8317.1957.tb00179.x   DOI
5 A. E. Hoerl, and R. W. Kennard, "Ridge regression: Biased estimation for nonorthogonal problems," Technometrics, Vol. 12, No. 1, pp. 55-67, 1970. DOI: https://doi.org/10.2307/1271436   DOI
6 R. Tibshirani, "Regression shrinkage and selection via the lasso," Journal of the Royal Statistical Society. Series B (Methodological), pp. 267-288, 1996. DOI: https://doi.org/10.2307/41262671
7 S. Arlot, and A. Celisse, "A survey of cross-validation procedures for model selection," Statistics surveys, Vol. 4, pp. 40-79, 2010. DOI: https://doi.org/10.1214/09-SS054   DOI
8 A. Cowling, and P. Hall, "On pseudo data methods for removing boundary effects in kernel density estimation," Journal of the Royal Statistical Society. Series B (Methodological), pp. 551-563, 1996. DOI: https://doi.org/10.2307/2345893
9 D. B. H. Cline, and J. D. Hart, "Kernel estimation of densities with discontinuities or discontinuous derivatives," Statistics: A Journal of Theoretical and Applied Statistics, Vol. 22, No. 1, pp. 69-84, 1991. DOI: https://doi.org/10.1080/02331889108802286   DOI
10 I. Gerlovina, Small Sample Inference, Doctoral dissertation, UC Berkeley, 2016. Available From: http://digitalassets.lib.berkeley.edu/etd/ucb/text/ Gerlovina_berkeley_0028E_16680.pdf
11 J. El Methni, L. Gardes, & S. Girard, "Kernel estimation of extreme regression risk measures," Electronic journal of statistics, Vol. 12, No. 1, pp. 359-398, 2018. DOI: https://doi.org/10.1214/18-EJS1392   DOI
12 M. Mudelsee, Extreme Value Time Series. In: Climate Time Series Analysis. Springer, pp. 217-267, 2014. DOI: https://doi.org/10.1007/978-90-481-9482-7
13 L. Breiman, "Using convex pseudo-data to increase prediction accuracy," breast (Wis), Vol. 5, No. 2, pp. 1-18, 1998. Available From: https://statistics.berkeley.edu/sites/default/files/techreports/513.pdf   DOI
14 D. Ruppert, and M. P. Wand, "Multivariate locally weighted least squares regression," The annals of statistics, pp. 1346-1370, 1994. DOI: https://doi.org/10.1214/aos/1176325632   DOI
15 A. Purwar, and S. K. Singh, "Hybrid prediction model with missing value imputation for medical data," Expert Systems with Applications, Vol. 42, No. 13, pp. 5621-5631, 2015. DOI: https://doi.org/10.1016/j.eswa.2015.02.050   DOI
16 Z. Liu, S. Sharma, and S. Datla, "Imputation of missing traffic data during holiday periods," Transportation Planning and Technology, Vol. 31, No. 5, pp. 525-544, 2008. DOI: https://doi.org/10.1080/03081060802364505   DOI
17 S. F. Wu, C. Y. Chang, and Lee, S. J., "Time series forecasting with missing values," In Industrial Networks and Intelligent Systems (INISCom), 2015 1st International Conference on IEEE, pp. 151-156, 2015. DOI: https://doi.org/10.4108/icst.iniscom.2015.258269
18 T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning, Springer, Berlin: Springer series in statistics, p. 38, 2009.
19 D. Blend, & T. Marwala, Comparison of data imputation techniques and their impact, Available From: https://arxiv.org/ftp/arxiv/papers/0812/0812.1539.pdf
20 M. R., Pina-Monarrez, "A new theory in multiple linear regression," International Journal Of Industrial Engineering, Vol. 18, No. 6, pp. 310-316, 2011 Available From: https://www.researchgate.net/publication/279181297A_new_theory_in_multiple_linear_regression
21 B. Al-hnaity, and M. Abbod, "Predicting Financial Time Series Data Using Hybrid Model," In Intelligent Systems and Applications. Springer International Publishing, pp. 19-41, 2017. DOI: https://doi.org/10.1007/978-3-319-33386-1_2