Browse > Article
http://dx.doi.org/10.7472/jksii.2019.20.1.01

Feature-selection algorithm based on genetic algorithms using unstructured data for attack mail identification  

Hong, Sung-Sam (Department of Computer Engineering, Gahon University)
Kim, Dong-Wook (Department of Computer Engineering, Gahon University)
Han, Myung-Mook (Department of Computer Engineering, Gahon University)
Publication Information
Journal of Internet Computing and Services / v.20, no.1, 2019 , pp. 1-10 More about this Journal
Abstract
Since big-data text mining extracts many features and data, clustering and classification can result in high computational complexity and low reliability of the analysis results. In particular, a term document matrix obtained through text mining represents term-document features, but produces a sparse matrix. We designed an advanced genetic algorithm (GA) to extract features in text mining for detection model. Term frequency inverse document frequency (TF-IDF) is used to reflect the document-term relationships in feature extraction. Through a repetitive process, a predetermined number of features are selected. And, we used the sparsity score to improve the performance of detection model. If a spam mail data set has the high sparsity, detection model have low performance and is difficult to search the optimization detection model. In addition, we find a low sparsity model that have also high TF-IDF score by using s(F) where the numerator in fitness function. We also verified its performance by applying the proposed algorithm to text classification. As a result, we have found that our algorithm shows higher performance (speed and accuracy) in attack mail classification.
Keywords
Security; Unstructured Data; Intelligent Data Analysis; Feature Selection; Attack Mail;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Sung-Sam Hong, Wanhee Lee, and Myung-Mook Han, "The Feature Selection Method based on Genetic Algorithm for Efficient of Text Clustering and Text Classification," International Journal of Advance Soft Computing Application, Vol. 5, No. 3, 2013.
2 Sung-Sam Hong, Dong-Wook Kim and Myung-Mook Han, "Feature-Selection Algorithm based on Genetic Algorithms for Intelligent Security Data Analysis of Unstructured Data," KSII The 12th Asia Pacific International Conference on Information Science and Technology(APIC-IST), Chiangmai, Thailand, 2017
3 Daniel L. Costa, Matthew L. Collins, Samuel J. Perl, Michael J. Albrethsen, George Silowash, and Derrick Spooner, "An Ontology for Insider Threat Indicators," Proceedings of the 9th Conference on Semantic Technology for Intelligence, Defense, and Security, Fairfax VA, pp.48-53, 2014.
4 He, Haibo and Edwardo Garcia, "Learning from imbalanced data," IEEE Transactions on Knowledge and Data Engineering, Vol.21, No.9, pp.1263-1284, 2009. https://doi.org/10.1109/tkde.2008.239   DOI
5 Chawla, Nitesh V., Nathalie Japkowicz, and Aleksander Kotcz, "Editorial: special issue on learning from imbalanced data sets," ACM SIGKDD Explorations Newsletter, Vol.6, No.1, pp.1-6, 2004.   DOI
6 Eun-Jin Kim, Uk Heo, Byoung-Chul Kim, Il-Kyu Eom, and Young-In Kim, "More Realistic Data Generation for the Imbalanced Class Problem," Journal of Korean Institute Of Information Technology, Vol.9, No.11, pp.143-150, 2011.
7 Joo-ho In, Jung-ho Kim, and Soo-hoan Cahe, "Combined Feature Set and Hybrid Feature Selection Method for Effective Document Classification," Journal of Korean Society for Internet Information, Vol.14, No.5, pp 49-57, 2013. https://doi.org/10.7472/jksii.2013.14.5.49   DOI
8 Mikel Galar, Alberto Fernandez, Edurne Barrenechea, Humberto Bustince, and Francisco Herrera, "A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches," IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), Vol.42, No.4, pp.463-484, 2012. https://doi.org/10.1109/tsmcc.2011.2161285   DOI
9 Zhongbin Sun, Qinbao Song, Xiaoyan Zhu, Heli Sun, Baowen Xu, and Yuming Zhou, "A novel ensemble method for classifying imbalanced data," The Journal of the Pattern Recognition Society, Vol.48, No.5, pp.1623-1637, 2015. https://doi.org/10.1016/j.patcog.2014.11.014   DOI
10 Robertson, Stephen. "Understanding inverse document frequency: On theoretical arguments for IDF," Journal of Documentation, Vol.60, No.5, pp.503-520, 2004. https://doi.org/10.1108/00220410410560582   DOI
11 John, G. Kohavi, and R. Pfleger, K., "Irrelevant Feature and the Subset Selection Problem", In Proceedings of 11th International Conference on Machine Learning, New Brunswick, NJ, pp.121-129, 1994. https://doi.org/10.1016/b978-1-55860-335-6.50023-4   DOI
12 http://cran.r-project.org/web/packages/tm/index.html
13 Monowar H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita, "Network Anomaly Detection: Methods, Systems and Tools," IEEE COMMUNICATIONS SURVEYS & TUTORIALS, Vol.16, No.1, pp.303-336, 2014. https://doi.org/10.1109/surv.2013.052213.00046   DOI
14 J. Van Rijsbergen, 1979, Information Retrieval, second ed., Buttersworth, London
15 http://www.r-project.org/
16 Bratko, Andrej; et al. "Spam filtering using statistical data compression models," The Journal of Machine Learning Research, No.7, pp. 2673-2698, 2006
17 http://cran.r-project.org/web/packages/GA/index.html
18 htps://cran.r-project.org/package=e1071
19 Androutsopoulos, J. Koutsias, K.V. Chandrinos, George Paliouras, and C.D. Spyropoulos, "An Evaluation of Naive Bayesian Anti-Spam Filtering", 11th European Conference on Machine Learning (ECML 2000), Warsaw, Poland, pp. 9-17, 2000.
20 THOMAS M. HAMILL, and JOSIP JURAS, "Measuring forecast skill: is it real skill or is it the varying climatology?," Quarterly Journal of the Royal Meteorological Society, Vol.132, No.621c, pp.2905-2923, 2006. https://doi.org/10.1256/qj.06.25   DOI
21 Roberts, N. M., and H. W. Lean, "Scale-selective verification of rainfall accumulations from high-resolution forecasts of convective events," Monthly Weather Review, Vol.136, No.1, pp. 78-97, 2008. https://doi.org/10.1175/2007mwr2123.1   DOI
22 http://www.cawcr.gov.au/projects/verification/
23 Nigro, M.A., J.J. Cassano and M.W. Seefeldt, "A weather-pattern-based approach to evaluate the Antarctic Mesoscale Prediction System (AMPS) forecasts : Comparison to automatic weather station observations," Weather Forecasting, Vol.26, No.2, pp.184-198, 2011. https://doi.org/10.1175/2010waf2222444.1   DOI
24 Wilks, D.S., Statistical Methods in the Atmospheric Sciences. 3rd Edition. Elsevier, p. 676, 2011. https://doi.org/10.1016/c2010-0-65519-2   DOI