DOI QR코드

DOI QR Code

On the Performance of Cuckoo Search and Bat Algorithms Based Instance Selection Techniques for SVM Speed Optimization with Application to e-Fraud Detection

  • Received : 2016.11.11
  • Accepted : 2017.06.11
  • Published : 2018.03.31

Abstract

Support Vector Machine (SVM) is a well-known machine learning classification algorithm, which has been widely applied to many data mining problems, with good accuracy. However, SVM classification speed decreases with increase in dataset size. Some applications, like video surveillance and intrusion detection, requires a classifier to be trained very quickly, and on large datasets. Hence, this paper introduces two filter-based instance selection techniques for optimizing SVM training speed. Fast classification is often achieved at the expense of classification accuracy, and some applications, such as phishing and spam email classifiers, are very sensitive to slight drop in classification accuracy. Hence, this paper also introduces two wrapper-based instance selection techniques for improving SVM predictive accuracy and training speed. The wrapper and filter based techniques are inspired by Cuckoo Search Algorithm and Bat Algorithm. The proposed techniques are validated on three popular e-fraud types: credit card fraud, spam email and phishing email. In addition, the proposed techniques are validated on 20 other datasets provided by UCI data repository. Moreover, statistical analysis is performed and experimental results reveals that the filter-based and wrapper-based techniques significantly improved SVM classification speed. Also, results reveal that the wrapper-based techniques improved SVM predictive accuracy in most cases.

Keywords

References

  1. C. Cortes and V. Vapnik, "Support-Vector Networks," Machine learning, vol. 20, no. 3, pp. 273-297, September, 1995. https://doi.org/10.1007/BF00994018
  2. B. Yashvantrai Vyas, R. P. Maheshwari, and B. Das, "Pattern Recognition Application of Support Vector Machine for Fault Classification of Thyristor Controlled Series Compensated Transmission Lines," Journal of The Institution of Engineers (India): Series B, vol. 97, no. 2, pp. 175-183, June, 2016.
  3. A. Bergholz, J. H. Chang, G. PaaB, F. Reichartz, and S. Strobel, "Improved Phishing Detection using Model-Based Features," in Proc. of the Conference on Email and Anti-Spam (CEAS), Mountain View, CA, pp. 1-27, August 21-22, 2008.
  4. A. A. Akinyelu and A. O. Adewumi, "Classification of phishing email using random forest machine learning technique," Journal of Applied Mathematics, vol. 2014, Article ID 425731, 6 pages, April, 2014.
  5. E. Kremic and A. Subasi, "Performance of random forest and SVM in face recognition," Int. Arab J. Inf. Technol., vol. 13, no. 2, pp. 287-293, March, 2016.
  6. N. Panda, E. Y. Chang, and G. Wu, "Concept boundary detection for speeding up SVMs," in Proc. of the 23rd international conference on Machine learning, pp. 681-688, June 25 - 29, 2006.
  7. J. A. Olvera-Lopez, J. A. Carrasco-Ochoa, J. F. Martinez-Trinidad, and J. Kittler, "A review of instance selection methods," Artificial Intelligence Review, vol. 34, no. 2, pp. 133-143, August, 2010. https://doi.org/10.1007/s10462-010-9165-y
  8. S. Fine and K. Scheinberg, "Efficient SVM training using low-rank kernel representations," The Journal of Machine Learning Research, vol. 2, pp. 243-264, December, 2002.
  9. B. L. Narayan, C. A. Murthy, and S. K. Pal, "Maxdiff kd-trees for data condensation," Pattern Recognition Letters, vol. 27, no. 3, pp. 187-200, February, 2006. https://doi.org/10.1016/j.patrec.2005.08.015
  10. H. Liu and H. Motoda, "On Issues of Instance Selection," Data Mining and Knowledge Discovery, vol. 6, no. 2, pp. 115-130, April, 2002. https://doi.org/10.1023/A:1014056429969
  11. J. C. Bezdek and L. I. Kuncheva, "Nearest prototype classifier designs: An experimental study," International Journal of Intelligent Systems, vol. 16, no. 12, pp. 1445-1473, December, 2001. https://doi.org/10.1002/int.1068
  12. V. Cerveron and F. J. Ferri, "Another move toward the minimum consistent subset: a tabu search approach to the condensed nearest neighbor rule," IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 31, no. 3, pp. 408-413, June, 2001. https://doi.org/10.1109/3477.931531
  13. J. A. Olvera-Lopez, J. A. Carrasco-Ochoa, and J. F. Martinez-Trinidad, "Sequential search for decremental edition," in Proc. of International Conference on Intelligent Data Engineering and Automated Learning, pp. 280-285, July 6-8, 2005.
  14. L. I. Kuncheva, "Fitness functions in editing k-NN reference set by genetic algorithms," Pattern Recognition, vol. 30, no. 6, pp. 1041-1049, June, 1997. https://doi.org/10.1016/S0031-3203(96)00134-3
  15. J. R. Cano, F. Herrera, and M. Lozano, "Stratification for scaling up evolutionary prototype selection," Pattern Recognition Letters, vol. 26, no. 7, pp. 953-963, May, 2005. https://doi.org/10.1016/j.patrec.2004.09.043
  16. S. Garcia, J. R. Cano, and F. Herrera, "A memetic algorithm for evolutionary prototype selection: A scaling up approach," Pattern Recognition, vol. 41, no. 8, pp. 2693-2709, August, 2008. https://doi.org/10.1016/j.patcog.2008.02.006
  17. I. M. Anwar, K. M. Salama, and A. M. Abdelbar, "Instance selection with ant colony optimization," Procedia Computer Science, vol. 53, pp. 248-256, January, 2015. https://doi.org/10.1016/j.procs.2015.07.301
  18. U. Garain, "Prototype reduction using an artificial immune model," Pattern Analysis and Applications, vol. 11, no. 3, pp. 353-363, September, 2008. https://doi.org/10.1007/s10044-008-0106-1
  19. M. Behdad, L. Barone, M. Bennamoun, and T. French, "Nature-inspired techniques in the context of fraud detection," IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 6, pp. 1273-1290, November, 2012. https://doi.org/10.1109/TSMCC.2012.2215851
  20. KrebsOnSecurity. (2015), "FBI: $1.2B Lost to Business Email Scams". available at: http://krebsonsecurity.com/2015/08/fbi-1-2b-lost-to-business-email-scams/ (accessed 14-September - 2016).
  21. T. N. Report. (2016, 01-August-2017). Card Fraud Worldwide. 12. Available: https://www.nilsonreport.com/upload/content_promo/The_Nilson_Report_10-17-2016.pdf
  22. H. Brighton and C. Mellish, "Advances in instance selection for instance-based learning algorithms," Data mining and knowledge discovery, vol. 6, no. 2, pp. 153-172, April, 2002. https://doi.org/10.1023/A:1014043630878
  23. T. Reinartz, "A Unifying View on Instance Selection," Data Mining and Knowledge Discovery, vol. 6, no. 2, pp. 191-210, April, 2002. https://doi.org/10.1023/A:1014047731786
  24. J. Yang and S. Olafsson, "Optimization-based feature selection with adaptive instance sampling," Computers & Operations Research, vol. 33, no. 11, pp. 3088-3106, November, 2006. https://doi.org/10.1016/j.cor.2005.01.021
  25. C.-F. Tsai, W. Eberle, and C.-Y. Chu, "Genetic algorithms in feature and instance selection," Knowledge-Based Systems, vol. 39, pp. 240-247, February, 2013. https://doi.org/10.1016/j.knosys.2012.11.005
  26. D. R. Wilson and T. R. Martinez, "Reduction Techniques for Instance-Based Learning Algorithms," Machine Learning, vol. 38, no. 3, pp. 257-286, March, 2000. https://doi.org/10.1023/A:1007626913721
  27. J. Chen, C. Zhang, X. Xue, and C.-L. Liu, "Fast instance selection for speeding up support vector machines," Knowledge-Based Systems, vol. 45, pp. 1-7, June, 2013. https://doi.org/10.1016/j.knosys.2013.01.031
  28. H. Lei and V. Govindaraju, "Speeding up multi-class SVM evaluation by PCA and feature election," in Proc. of the Workshop on Feature Selection for Data Mining:Interfacing Machine Learning and Statistics Newport Beach, CA, April 22, 2005.
  29. A. O. Adewumi and M. M. Ali, "A multi-level genetic algorithm for a multi-stage space allocation problem," Mathematical and Computer Modelling, vol. 51, no. 1, pp. 109-126, January, 2010. https://doi.org/10.1016/j.mcm.2009.09.004
  30. T. R. Jensen and B. Toft, "Graph coloring problems," vol. 39, 2011.
  31. S. Chetty and A. O. Adewumi, "Three new stochastic local search metaheuristics for the annual crop planning problem based on a new irrigation scheme," Journal of Applied Mathematics, vol. 2013, Article ID 158538, 14 pages, 2013., May, 2013.
  32. O. A. Adewumi and A. A. Akinyelu, "A hybrid firefly and support vector machine classifier for phishing email detection," Kybernetes, vol. 45, no. 6, pp. 977-994, June, 2016. https://doi.org/10.1108/K-07-2014-0129
  33. X.-S. Yang and X. He, "Firefly algorithm: recent advances and applications," International Journal of Swarm Intelligence, vol. 1, no. 1, pp. 36-50, January, 2013. https://doi.org/10.1504/IJSI.2013.055801
  34. J. Kennedy and R. Eberhart, "Particle swarm optimization," in Proc. of IEEE international conference on neural networks, vol. 4, no. 2, pp. 1942-1948, November, 1995.
  35. S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, "Optimization by simulated annealing," science, vol. 220, no. 4598, pp. 671-680, May, 1983. https://doi.org/10.1126/science.220.4598.671
  36. X.-S. Yang and S. Deb, "Cuckoo search via Levy flights," in Proc. of World Congress on Nature & Biologically Inspired Computing, 2009. NaBIC 2009. , pp. 210-214, December 9-11, 2009.
  37. X.-S. Yang, "A New Metaheuristic Bat-Inspired Algorithm," in Proc. of Nature Inspired Cooperative Strategies for Optimization (NICSO 2010), J. R. Gonzalez, D. A. Pelta, C. Cruz, G. Terrazas, and N. Krasnogor, Eds., ed Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 65-74, 2010.
  38. D. Rodrigues, L. A. M. Pereira, R. Y. M. Nakamura, K. A. P. Costa, X.-S. Yang, A. N. Souza, et al., "A wrapper approach for feature selection based on Bat Algorithm and Optimum-Path Forest," Expert Systems with Applications, vol. 41, no. 5, pp. 2250-2258, April, 2014. https://doi.org/10.1016/j.eswa.2013.09.023
  39. S. A. Medjahed, T. A. Saadi, A. Benyettou, and M. Ouali, "Binary cuckoo search algorithm for band selection in hyperspectral image classification," IAENG International Journal of Computer Science, vol. 42, no. 3, pp. 183-191, July, 2015.
  40. A. M. Taha, A. Mustapha, and S.-D. Chen, "Naive bayes-guided bat algorithm for feature selection," The Scientific World Journal, vol. 2013, Article ID 325973, 9 pages, 2013., December, 2013.
  41. E. Emary, W. Yamany, and A. E. Hassanien, "New approach for feature selection based on rough set and bat algorithm," in Proc. of 9th International Conference on Computer Engineering & Systems (ICCES), pp. 346-353, December 22-23, 2014.
  42. M. A. Laamari and N. Kamel, "A hybrid bat based feature selection approach for intrusion detection," in Proc. of Bio-Inspired Computing-Theories and Applications, ed: Springer, pp. 230-238, 2014.
  43. R. R Rajalaxmi, "A Hybrid Binary Cuckoo Search and Genetic Algorithm for Feature Selection in Type-2 Diabetes," Current Bioinformatics, vol. 11, no. 4, pp. 490-499, September, 2016. https://doi.org/10.2174/1574893611666151228190309
  44. S. Mousavirad and H. Ebrahimpour-Komleh, "Wrapper feature selection using discrete cuckoo optimization algorithm," International Journal of Mechatronics Electrical, and Computer Engineering, vol. 4, no. 11, pp. 709-721, April, 2014.
  45. K. Bache and M. Lichman. (2013), "UCI machine learning repository". available at: http://archive.ics.uci.edu/ml (accessed 12-May-2017).
  46. C.-C. Chang and C.-J. Lin, "LIBSVM: a library for support vector machines," ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, p. 27, April, 2011.
  47. P. Graham., "A Plan for Spam," 2002. available at: http://www.paulgraham.com/spam.html (accessed 04-August-2016).
  48. R. Shams and R. E. Mercer, "Classifying Spam Emails Using Text and Readability Features," in Proc. of IEEE 13th International Conference on Data Mining, pp. 657-666, December 7-10, 2013.
  49. R. Duncan. "A Simple Guide to HTML," available at: http://www.simplehtmlguide.com/whatisht-ml.php (accessed 13-September-2016).
  50. A. Almomani, T.-C. Wan, A. Altaher, A. Manasrah, E. ALmomani, M. Anbar, et al., "Evolving fuzzy neural network for phishing emails detection," Journal of Computer Science, vol. 8, no. 7, p. 1099, July, 2012. https://doi.org/10.3844/jcssp.2012.1099.1107
  51. I. Fette, N. Sadeh, and A. Tomasic, "Learning to detect phishing emails," in Proc. of the 16th international conference on World Wide Web, Banff, AB, Canada, pp. 649-656, May 8-12, 2007.
  52. N. Zhang and Y. Yuan, "Phishing Detection Using Neural Network," CS229 lecture notes.
  53. C. Group., "SpamAssassin Data," 2006. available at: http://www.csmining.org/index.php/spamassassin-datasets.html (accessed 05-August-2014).
  54. J. Nazario., "Phishing Corpus," 2006. available at: http://monkey.org/jose/wiki/doku.php?id=PhishingCorpus (accessed 27-April-2015).
  55. A. Asuncion and D. Newman., "UCI Machine Learning Repository," 2007. available at: http://archive.ics.uci.edu/ml/datasets.html (accessed 15-August-2016).
  56. Andrea., "Credit Card Fraud Detection," 2016. available at: https://www.kaggle.com/dalpozz/creditcardfraud (accessed 12-December-2016).
  57. J. A. Olvera-Lopez, J. A. Carrasco-Ochoa, and J. F. Martinez-Trinidad, "A new fast prototype selection method based on clustering," Pattern Analysis and Applications, vol. 13, no. 2, pp. 131-141, May, 2010. https://doi.org/10.1007/s10044-008-0142-x
  58. C. Chien-Hsing, K. Bo-Han, and C. Fu, "The Generalized Condensed Nearest Neighbor Rule as A Data Reduction Method," in Proc. of 18th International Conference on Pattern Recognition (ICPR'06), pp. 556-559, August 20-24, 2006.
  59. T. Raicharoen and C. Lursinsap, "A divide-and-conquer approach to the pairwise opposite class-nearest neighbor (POC-NN) algorithm," Pattern Recognition Letters, vol. 26, no. 10, pp. 1554-1567, July, 2005. https://doi.org/10.1016/j.patrec.2005.01.003
  60. C.-W. Hsu, C.-C. Chang, and C.-J. Lin, "A practical guide to support vector classification. Tech. rep., Department of Computer Science, National Taiwan University.," no. 1-16, 2003.
  61. X.-S. Yang. (2010), "Cuckoo Search (CS) Algorithm," available at: https://www.mathworks.com/matlabcentral/fileexchange/29809-cuckoo-search-cs-algorithm/content/cuckoo_search.m (accessed 11-September-2016).
  62. X.-S. Yang. (2015), "Bat Algorithm". available at: https://www.mathworks.com/matlabcentral/fileexchange/37582-bat-algorithm--demo-/content/bat_algorithm.m (accessed 11-September-2016).
  63. M. Riyazuddin and V. V. S. S. S. Balaram, "Pattern Anonymization: Hybridizing Data Restructure with Feature Set Partitioning for Privacy Preserving in Supervised Learning," in Proc. of the First International Conference on Computational Intelligence and Informatics : ICCII 2016, S. C. Satapathy, V. K. Prasad, B. P. Rani, S. K. Udgata, and K. S. Raju, Eds., ed Singapore: Springer Singapore, pp. 603-614, 2017.

Cited by

  1. Enhanced Feature Subset Selection Using Niche Based Bat Algorithm vol.7, pp.3, 2018, https://doi.org/10.3390/computation7030049
  2. Representativeness-Based Instance Selection for Intrusion Detection vol.2021, pp.None, 2018, https://doi.org/10.1155/2021/6638134
  3. Comparative study on credit card fraud detection based on different support vector machines vol.25, pp.1, 2021, https://doi.org/10.3233/ida-195011
  4. Binary Bat Algorithm for text feature selection in news events detection model using Markov clustering vol.9, pp.1, 2018, https://doi.org/10.1080/23311916.2021.2010923