Browse > Article
http://dx.doi.org/10.13088/jiis.2014.20.1.015

Smarter Classification for Imbalanced Data Set and Its Application to Patent Evaluation  

Kwon, Ohbyung (School of Management, Kyung Hee University)
Lee, Jonathan Sangyun (School of Management, Kyung Hee University)
Publication Information
Journal of Intelligence and Information Systems / v.20, no.1, 2014 , pp. 15-34 More about this Journal
Abstract
Overall, accuracy as a performance measure does not fully consider modular accuracy: the accuracy of classifying 1 (or true) as 1 is not same as classifying 0 (or false) as 0. A smarter classification algorithm would optimize the classification rules to match the modular accuracies' goals according to the nature of problem. Correspondingly, smarter algorithms must be both more generalized with respect to the nature of problems, and free from decretization, which may cause distortion of the real performance. Hence, in this paper, we propose a novel vertical boosting algorithm that improves modular accuracies. Rather than decretizing items, we use simple classifiers such as a regression model that accepts continuous data types. To improve the generalization, and to select a classification model that is well-suited to the nature of the problem domain, we developed a model selection algorithm with smartness. To show the soundness of the proposed method, we performed an experiment with a real-world application: predicting the intellectual properties of e-transaction technology, which had a 47,000+ record data set.
Keywords
데이터 불균형 문제;앙상블 기법;지능형 시스템;수직 부스팅;
Citations & Related Records
Times Cited By KSCI : 3  (Citation Analysis)
연도 인용수 순위
1 Andrews, R., J. Diederich, and A. Tickle, "A Survey and Critique of Techniques for Extracting Rules from Trained Artificial Neural Networks," Knowledge Based Systems, Vol. 8, No. 6 (1995), 373-389.   DOI   ScienceOn
2 Brazdil, P. B. and C. Soares, "A Comparison of Ranking Methods for Classification Algorithm Selection," Lecture Notes in Artificial Intelligence, Vol. 1810 (2000), 63-75.
3 Breiman, L., "Stacked Regressions," Machine Learning, Vol. 24, No. 1 (1996), 49-64.
4 Cohen, W., "Fast Effective Rule Induction," Proceedings of the 12th International Conference on Machine Learning (1995), 115-123.
5 Forbes, A. D., "Classification-Algorithm Evaluation: Five Performance Measures Based on Confusion Matrices," Journal of Clinical Monitoring, Vol. 11, No. 3 (1995), 189-206.   DOI
6 Krogh, A. and J. Vedelsby, "Neural Network Ensembles, Cross Validation, and Active Learning," Advances in Neural Information Processing Systems, Vol. 7(1995), 231-238.
7 Ha, E., J. Kim, G. Ryu, "Ensemble of Nested Dichotomies for Activity Recognition Using Accelerometer Data on Smartphone," Journal of Intelligence and Information Systems, Vol. 19, No. 4 (2013), 123-132.   DOI   ScienceOn
8 Hwang, J. P., S. Park, and E. Kim, "A New Weighted Approach to Imbalanced Data Classification Problem via Support Vector Machine with Quadratic Cost Function," Expert Systems with Applications, Vol. 38 (2011), 8580-8585.   DOI   ScienceOn
9 KIPI, Available at http://kpeg.kipi.or.kr/sub01_01.jsp (Download 31 December, 2013).
10 Kuncheva, L. I. and L. C. Jain, "Designing Classifier Fusion Systems by Genetic Algorithms," IEEE Transactions on Evolutionary Computation, Vol. 4, No. 4(2011), 327-336.
11 Kwon, O., "A New Ensemble Method for Gold Mining Problems: Predicting Technology Transfer," Electronic Commerce Research and Applications, Vol. 11, No. 2(2011), 117-128.
12 Langley, P., W. Iba, and K. Thompson, "An Analysis of Bayesian Classifiers," Proceedings of the National Conference on Artificial Intelligence (1992), 223-228.
13 Lee, J. and J. Kwon, "A Hybrid SVM Classifier for Imbalanced Data Sets," Journal of Intelligence and Information Systems, Vol. 19, No. 2 (2013), 125-140.
14 Lincoln, W. and J. Skrzypek, "Synergy of Clustering Multiple Back Propagation Networks," Advances in Neural Information Processing Systems, Vol. 2(1989), 650-659.
15 Liu, B., Q. Cui, T. Jiang, and S. Ma, "A Combinational Feature Selection and Ensemble Neural Network Method for Classification of Gene Expression Data," Bioinformatics, Vol. 5, No. 136(2004), 51-131.
16 Lu, C. and T. Chen, "A Study of Applying Data Mining Approach to the Information Disclosure for Taiwan's Stock Market Investors," Expert Systems With Applications, Vol. 36, No. 2 (2009), 3356-3542.
17 Opitz, D. and R. Maclin, "Popular Ensemble Methods: An Empirical Study," Journal of Artificial Intelligence Research, Vol. 11 (1999), 169-198.
18 Park, J. and K. Kwak, "The Effect of Patent Citation Relationship on Business Performance: A Social Network Analysis Perspective," Journal of Intelligence and Information Systems, Vol. 19, No. 3(2013), 127-139.   DOI   ScienceOn
19 Rosset, S., "Model Selection via the AUC," Proceedings of the 21st International Conference on Machine Learning (2004).
20 Qin, B., Y. Xia, S. Prabhakar, and Y. Tu, "A Rule-Based Classification Algorithm for Uncertain Data," IEEE International Conference on Data Engineering (2009), 1633-1640.
21 Quinlan, J. R., C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, Burlington, Massachusetts, 1993.
22 Raudys, S. and V. Pikelis, "On Dimensionality, Sample Size, Classification Error, and Complexity of Classification Algorithm in Pattern Recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-2:3 (1980), 242-252.
23 Sarunas, R., P. Vitalijus, and L.R. Moksly, "On Dimensionality, Sample Size, Classification Error, and Complexity of Classification Algorithm in Pattern Recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 2, No. 3 (2009), 242-252.
24 Shi, L., X. Ma, L. Xi, and X. Hu, "Financial Data Mining Based on Support Vector Machines and Ensemble Learning," Proceedings of the 2010 International Conference on Intelligent Computation Technology and Automation, Changsha, China (2010), 313-314.
25 Shi, L., L. Xi, X. Ma, M. Weng, and X. Hu, "A Novel Ensemble Algorithm for Biomedical Classification Based on Ant Colony Optimization," Applied Soft Computing, Vol. 11, No. 8(2011), 5674-5683.   DOI   ScienceOn
26 Theissen, E., "A Rest of the Accuracy of the Lee Ready Trade Classification Algorithm," Journal of International Financial Markets, Institutions and Money, Vol. 11, No. 2(2001), 147-165.   DOI   ScienceOn
27 Vapnik, V. The Nature of Statistical Learning Theory. Springer Verlag New York, NY, USA (1995).
28 Wu, C. and H. Xia, "Study of Personal Credit Evaluation Under C2C Environment Based on Support Vector Machines Ensemble," Proceedings on the 15th Annual Conference on Management Science and Engineering, Long Beach, CA (2008), 25-31.
29 Yang, H. and I. King, "Ensemble Learning for Imbalanced E-Commerce Transaction Anomaly Classification," Lecture Notes in Computer Science, Vol. 5863/2009 (2009), 866-874.
30 Brachman, R. J. and T. Anand, The Process of Knowledge Discovery in Databases, In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, AAAI Press/The MIT Press, 1996.