[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.5626/JCSE.2016.10.2.51

Influence of Data Preprocessing

Zhu, Changming (College of Information Engineering, Shanghai Maritime University and Department of Computer Science & Engineering, East China University of Science & Technology)
Gao, Daqi (Department of Computer Science & Engineering, East China University of Science & Technology)

Publication Information

Journal of Computing Science and Engineering / v.10, no.2, 2016 , pp. 51-57 More about this Journal

Abstract

In this paper, we research the influence of data preprocessing. We conclude that using different preprocessing methods leads to different classification performances. Moreover, not all data preprocessing methods are necessary, and a criterion is given to make sure which data preprocessing is necessary and which one is effective. Experiments on some real-world data sets validate that different data preprocessing methods result in different effects. Furthermore, experiments about some algorithms with different preprocessing methods also confirm that preprocessing has a great influence on the performance of a classifier.

Keywords

Data preprocessing; Preprocessing criterion; Fisher; Pseudo Inverse;

Citations & Related Records

Reference

1	S. Chen, Y. Zhu, D. Zhang, and J. Y. Yang, "Feature extraction approaches based on matrix pattern: MatPCA and Mat-FLDA," Pattern Recognition Letters, vol. 26, no. 8, pp. 1157-1167, 2005. DOI
2	N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, Cambridge: Cambridge University Press, 2000.
3	J. A. Hartigan and M. A. Wong, "Algorithm AS 136: a kmeans clustering algorithm," Journal of the Royal Statistical Society Series C (Applied Statistics), vol. 28, no. 1, pp. 100-108, 1979.
4	A. J. Jain and R. C. Dubes, Algorithms for Clustering Data, Englewood Cliffs, NJ: Prentice-Hall Inc., 1988.
5	C. Zhu, "Improved multi-kernel classification machine with Nyström approximation technique and Universum data," Neurocomputing, vol. 175A, pp. 610-634, 2016.
6	V. N. Vapnik, Statistical Learning Theory, New York: Wiley, 1998.
7	E. Fix and J. L. Hodges, "Discriminatory analysis: nonparametric discrimination: consistency properties," International Statistical Review, vol. 57, no. 3, pp. 238-247, 1989. DOI
8	A. Y. Ng, M. I. Jordan, and Y. Weiss, "On spectral clustering: analysis and an algorithm," Advances in Neural Information Processing Systems, vol. 2, pp. 849-856, 2002.
9	S. X. Yu and J. Shi, "Multiclass spectral clustering," in Proceedings of 9th IEEE International Conference on Computer Vision, Nice, France, 2003, pp. 313-319.
10	K. Person, "On lines and planes of closest fit to system of points in space," Philiosophical Magazine Series 6, vol. 2, no. 11, pp. 559-572, 1901. DOI
11	I. T. Jolliffe, Principal Component Analysis, New York: Springer, 2002.
12	C. Saunders, J. Shawe-Taylor, and A. Vinokourov, "String kernels, fisher kernels and finite state automata," Advances in Neural Information Processing Systems, vol. 15, pp. 649-656, 2003.
13	D. J. Newman, S. Hettich, C. L. Blake, C. J. Merz, and D. W. Aha, "UCI repository of machine learning databases," 1998; http://archive.ics.uci.edu/ml/datasets.htm.