[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3745/JIPS.04.0149

Feature Selection Using Submodular Approach for Financial Big Data

Attigeri, Girija (Dept. of Information and Communication Technology, Manipal Institute of Technology, Manipal Academy of Higher Education)
Manohara Pai, M.M. (Manipal Institute of Technology, Manipal Academy of Higher Education)
Pai, Radhika M. (Manipal Institute of Technology, Manipal Academy of Higher Education)

Publication Information

Journal of Information Processing Systems / v.15, no.6, 2019 , pp. 1306-1325 More about this Journal

Abstract

As the world is moving towards digitization, data is generated from various sources at a faster rate. It is getting humungous and is termed as big data. The financial sector is one domain which needs to leverage the big data being generated to identify financial risks, fraudulent activities, and so on. The design of predictive models for such financial big data is imperative for maintaining the health of the country's economics. Financial data has many features such as transaction history, repayment data, purchase data, investment data, and so on. The main problem in predictive algorithm is finding the right subset of representative features from which the predictive model can be constructed for a particular task. This paper proposes a correlation-based method using submodular optimization for selecting the optimum number of features and thereby, reducing the dimensions of the data for faster and better prediction. The important proposition is that the optimal feature subset should contain features having high correlation with the class label, but should not correlate with each other in the subset. Experiments are conducted to understand the effect of the various subsets on different classification algorithms for loan data. The IBM Bluemix BigData platform is used for experimentation along with the Spark notebook. The results indicate that the proposed approach achieves considerable accuracy with optimal subsets in significantly less execution time. The algorithm is also compared with the existing feature selection and extraction algorithms.

Keywords

Classification; Correlation; Feature Subset Selection; Financial Big Data; Logistic Regression; Submodular Optimization; Support Vector Machine;

Citations & Related Records

Times Cited By KSCI : 2 (Citation Analysis)

Reference
Cited By KSCI

1	P. Sarlin, "Data and dimension reduction for visual financial performance analysis," Information Visualization, vol. 14, no. 2, pp. 148-167, 2015. DOI
2	H. S. Bhat and D. Zaelit, "Forecasting retained earnings of privately held companies with PCA and L1 regression," Applied Stochastic Models in Business and Industry, vol. 30, no. 3, pp. 271-293, 2014. DOI
3	I. Pisica, G. Taylor, and L. Lipan, "Feature selection filter for classification of power system operating states," Computers &Mathematics with Applications, vol. 66, no. 10, pp. 1795-1807, 2013. DOI
4	H. Liu and H. Motoda, Feature Selection for Knowledge Discovery and Data Mining. New York, NY: Springer Science & Business Media, 2012.
5	M. Dash, "Feature selection via set cover," in Proceedings 1997 IEEE Knowledge and Data Engineering Exchange Workshop, Newport Beach, CA, 1997, pp. 165-171.
6	A. Arauzo-Azofra, J. M. Benitez, and J. L. Castro, "A feature set measure based on relief," in Proceedings of the 5th International Conference on Recent Advances in Soft Computing, Nottingham, UK, 2004, pp. 104-109.
7	X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, et al., "MLlib: machine learning in Apache Spark," The Journal of Machine Learning Research, vol. 17, pp. 1-7, 2016.
8	T. Seth and V. Chaudhary, "Big data in finance," in Big Data: Algorithms, Analytics, and Applications. Boca Raton, FL: CRC Press, 2015, pp. 329-356.
9	I. Taleb, R. Dssouli, and M. A. Serhani, "Big data pre-processing: a quality framework," in Proceedings of 2015 IEEE International Congress on Big Data, New York, NY, 2015, pp. 191-198.
10	J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, and H. Liu, "Feature selection: a data perspective," ACM Computing Surveys, vol. 50, no. 6, article no. 94, 2018.
11	B. Arguello, "A survey of feature selection methods: algorithms and software," PhD dissertation, University of Texas at Austin, TX, 2015.
12	A. Krause, "SFO: a toolbox for submodular function optimization," Journal of Machine Learning Research, vol. 11, pp. 1141-1144, 2010.
13	S. Fallahpour, E. N. Lakvan, and M. H. Zadeh, "Using an ensemble classifier based on sequential floating forward selection for financial distress prediction problem," Journal of Retailing and Consumer Services, vol. 34, pp. 159-167, 2017. DOI
14	K. Noyes, "Five things you need to know about Hadoop v. Apache Spark," 2015; https://www.infoworld.com/article/3014440/five-things-you-need-to-know-about-hadoop-vapache- spark.html.
15	P. Paakkonen and D. Pakkala, "Reference architecture and classification of technologies, products and services for big data systems," Big Data Research, vol. 2, no. 4, pp. 166-186, 2015. DOI
16	A. Abdiansah and R. Wardoyo, "Time complexity analysis of support vector machines (SVM) in LibSVM," International Journal Computer and Application, vol. 128, no. 3, pp. 28-34, 2015. DOI
17	M. A. Fattah, "A novel statistical feature selection approach for text categorization," Journal of Information Processing Systems, vol. 13, no. 5, pp. 1397-1409, 2017. DOI
18	K. Kira and L. A. Rendell, "A practical approach to feature selection," in Machine Learning Proceedings 1992. St. Louis, MO: Elsevier, 1992, pp. 249-256.
19	E. Wright, Q. Hao, K. Rasheed, and Y. Liu, "Feature selection of post-graduation income of college students in the United States," 2018; https://arxiv.org/abs/1803.06615.
20	S. D. Kim, "A feature selection technique based on distributional differences," Journal of Informaion Processing System, vol. 2, no. 1, pp. 23-27, 2006. DOI
21	S. Maldonado, J. Perez, and C. Bravo, "Cost-based feature selection for support vector machines: an application in credit scoring," European Journal of Operational Research, vol. 261, no. 2, pp. 656-665, 2017. DOI
22	A. Krause and V. Cevher, "Submodular dictionary selection for sparse representation," in Proceedings of the 27th International Conference on Machine Learning (ICML), Haifa, Israel, 2010, pp. 567-574.
23	Y. Bar, I. Diamant, L. Wolf, S. Lieberman, E. Konen, and H. Greenspan, "Chest pathology identification using deep feature selection with non-medical training," Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, vol. 6, no. 3, pp. 259-263, 2018. DOI
24	D. Kempe, J. Kleinberg, and E. Tardos, "Maximizing the spread of influence through a social network," in Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, 2003, pp. 137-146.
25	R. Iyer, S. Jegelka, and J. Bilmes, "Fast semidifferential-based submodular function optimization," Proceedings of the 30th International Conference on Machine Learning (ICML), Atlanta, GA, 2013, pp. 855-863.
26	K. Wei, Y. Liu, K. Kirchhoff, and J. Bilmes, "Using document summarization techniques for speech data subset selection," in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, GA, 2013, pp. 721-726.
27	A. Krause and C. Guestrin, "A note on the budgeted maximization of submodular functions," Carnegie Mellon University, Technical Report No. CMU-CALD-05-103, 2005.
28	G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher, "An analysis of approximations for maximizing submodular set functions - I," Mathematical Programming, vol. 14, no. 1, pp. 265-294, 1978. DOI
29	M. A. Hall, "Correlation-based feature selection for machine learning," PhD dissertation, The University of Waikato, Hamilton, New Zealand, 1999.
30	A. Pouramirarsalani, M. Khalilian, and A. Nikravanshalmani, "Fraud detection in E-banking by using the hybrid feature selection and evolutionary algorithms," International Journal of Computer Science and Network Security, vol. 17, no. 8, pp. 271-279, 2017.
31	Y. Wang, W. Ke, and X. Tao, "A feature selection method for large-scale network traffic classification based on spark," Information, vol. 7, article no. 6, 2016.
32	J. Giersdorf and M. Conzelmann, "Analysis of feature-selection for LASSO regression models," 2017; https://www.ni.tu-berlin.de/fileadmin/fg215/teaching/nnproject/Lasso_Project.pdf.
33	V. Fonti and E. Belitser, "Feature selection using lasso," VU Amsterdam Research Paper in Business Analytics, 2017; https://beta.vu.nl/nl/Images/werkstuk-fonti_tcm235-836234.pdf
34	H. D. Gangurde, "Feature selection using clustering approach for big data," International Journal of Computer Applications, vol. 2014, no. 4, pp. 1-3, 2014.