DOI QR코드

DOI QR Code

Influence of Data Preprocessing

  • Zhu, Changming (College of Information Engineering, Shanghai Maritime University and Department of Computer Science & Engineering, East China University of Science & Technology) ;
  • Gao, Daqi (Department of Computer Science & Engineering, East China University of Science & Technology)
  • Received : 2015.03.19
  • Accepted : 2016.05.20
  • Published : 2016.06.30

Abstract

In this paper, we research the influence of data preprocessing. We conclude that using different preprocessing methods leads to different classification performances. Moreover, not all data preprocessing methods are necessary, and a criterion is given to make sure which data preprocessing is necessary and which one is effective. Experiments on some real-world data sets validate that different data preprocessing methods result in different effects. Furthermore, experiments about some algorithms with different preprocessing methods also confirm that preprocessing has a great influence on the performance of a classifier.

Keywords

References

  1. S. Chen, Y. Zhu, D. Zhang, and J. Y. Yang, "Feature extraction approaches based on matrix pattern: MatPCA and Mat-FLDA," Pattern Recognition Letters, vol. 26, no. 8, pp. 1157-1167, 2005. https://doi.org/10.1016/j.patrec.2004.10.009
  2. N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, Cambridge: Cambridge University Press, 2000.
  3. J. A. Hartigan and M. A. Wong, "Algorithm AS 136: a kmeans clustering algorithm," Journal of the Royal Statistical Society Series C (Applied Statistics), vol. 28, no. 1, pp. 100-108, 1979.
  4. A. J. Jain and R. C. Dubes, Algorithms for Clustering Data, Englewood Cliffs, NJ: Prentice-Hall Inc., 1988.
  5. C. Zhu, "Improved multi-kernel classification machine with Nyström approximation technique and Universum data," Neurocomputing, vol. 175A, pp. 610-634, 2016.
  6. V. N. Vapnik, Statistical Learning Theory, New York: Wiley, 1998.
  7. E. Fix and J. L. Hodges, "Discriminatory analysis: nonparametric discrimination: consistency properties," International Statistical Review, vol. 57, no. 3, pp. 238-247, 1989. https://doi.org/10.2307/1403797
  8. A. Y. Ng, M. I. Jordan, and Y. Weiss, "On spectral clustering: analysis and an algorithm," Advances in Neural Information Processing Systems, vol. 2, pp. 849-856, 2002.
  9. S. X. Yu and J. Shi, "Multiclass spectral clustering," in Proceedings of 9th IEEE International Conference on Computer Vision, Nice, France, 2003, pp. 313-319.
  10. K. Person, "On lines and planes of closest fit to system of points in space," Philiosophical Magazine Series 6, vol. 2, no. 11, pp. 559-572, 1901. https://doi.org/10.1080/14786440109462720
  11. I. T. Jolliffe, Principal Component Analysis, New York: Springer, 2002.
  12. C. Saunders, J. Shawe-Taylor, and A. Vinokourov, "String kernels, fisher kernels and finite state automata," Advances in Neural Information Processing Systems, vol. 15, pp. 649-656, 2003.
  13. D. J. Newman, S. Hettich, C. L. Blake, C. J. Merz, and D. W. Aha, "UCI repository of machine learning databases," 1998; http://archive.ics.uci.edu/ml/datasets.htm.