DOI QR코드

DOI QR Code

Smarter Classification for Imbalanced Data Set and Its Application to Patent Evaluation

불균형 데이터 집합에 대한 스마트 분류방법과 특허 평가에의 응용

  • Received : 2014.01.14
  • Accepted : 2014.01.27
  • Published : 2014.03.28

Abstract

Overall, accuracy as a performance measure does not fully consider modular accuracy: the accuracy of classifying 1 (or true) as 1 is not same as classifying 0 (or false) as 0. A smarter classification algorithm would optimize the classification rules to match the modular accuracies' goals according to the nature of problem. Correspondingly, smarter algorithms must be both more generalized with respect to the nature of problems, and free from decretization, which may cause distortion of the real performance. Hence, in this paper, we propose a novel vertical boosting algorithm that improves modular accuracies. Rather than decretizing items, we use simple classifiers such as a regression model that accepts continuous data types. To improve the generalization, and to select a classification model that is well-suited to the nature of the problem domain, we developed a model selection algorithm with smartness. To show the soundness of the proposed method, we performed an experiment with a real-world application: predicting the intellectual properties of e-transaction technology, which had a 47,000+ record data set.

성과 지표로서의 전방적 정확도는 정답인 경우 1, 오답인 경우 0으로 계사하는 이른바 모듈화된 정확도를 충분히 고려하지 못한다. 이에 문제의 특징에 따라 모듈화 정확도에 맞는 판별 규칙을 최적화 하는 보다 스마트한 판별 알고리즘이 필요하다고 볼 수 있다. 이에 따라, 스마트한 알고리즘은 문제 유형에 따라 보다 일반화되고 실제 성능의 왜곡을 야기할 수 있는 이산화에 제약되지 않아야 한다. 따라서 본 논문의 목적인 모듈화 정확도를 개선하는 새로운 부스팅 알고리즘을 제안하는 것이다. 이에 일반화를 도모하고 문제 영역의 특성에 맞게 판별화 모형을 선정하기 위해 스마트함을 고려한 모형 선정 알고리즘을 개발하였다. 제안된 방법의 성능을 검증하기 위해 실제로 47,000여건의 특허건을 가지고 실제 실용화 가능성을 판별하는 실험을 수행하였다.

Keywords

References

  1. Andrews, R., J. Diederich, and A. Tickle, "A Survey and Critique of Techniques for Extracting Rules from Trained Artificial Neural Networks," Knowledge Based Systems, Vol. 8, No. 6 (1995), 373-389. https://doi.org/10.1016/0950-7051(96)81920-4
  2. Brachman, R. J. and T. Anand, The Process of Knowledge Discovery in Databases, In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, AAAI Press/The MIT Press, 1996.
  3. Brazdil, P. B. and C. Soares, "A Comparison of Ranking Methods for Classification Algorithm Selection," Lecture Notes in Artificial Intelligence, Vol. 1810 (2000), 63-75.
  4. Breiman, L., "Stacked Regressions," Machine Learning, Vol. 24, No. 1 (1996), 49-64.
  5. Cohen, W., "Fast Effective Rule Induction," Proceedings of the 12th International Conference on Machine Learning (1995), 115-123.
  6. Forbes, A. D., "Classification-Algorithm Evaluation: Five Performance Measures Based on Confusion Matrices," Journal of Clinical Monitoring, Vol. 11, No. 3 (1995), 189-206. https://doi.org/10.1007/BF01617722
  7. Ha, E., J. Kim, G. Ryu, "Ensemble of Nested Dichotomies for Activity Recognition Using Accelerometer Data on Smartphone," Journal of Intelligence and Information Systems, Vol. 19, No. 4 (2013), 123-132. https://doi.org/10.13088/jiis.2013.19.4.123
  8. Hwang, J. P., S. Park, and E. Kim, "A New Weighted Approach to Imbalanced Data Classification Problem via Support Vector Machine with Quadratic Cost Function," Expert Systems with Applications, Vol. 38 (2011), 8580-8585. https://doi.org/10.1016/j.eswa.2011.01.061
  9. KIPI, Available at http://kpeg.kipi.or.kr/sub01_01.jsp (Download 31 December, 2013).
  10. Krogh, A. and J. Vedelsby, "Neural Network Ensembles, Cross Validation, and Active Learning," Advances in Neural Information Processing Systems, Vol. 7(1995), 231-238.
  11. Kuncheva, L. I. and L. C. Jain, "Designing Classifier Fusion Systems by Genetic Algorithms," IEEE Transactions on Evolutionary Computation, Vol. 4, No. 4(2011), 327-336.
  12. Kwon, O., "A New Ensemble Method for Gold Mining Problems: Predicting Technology Transfer," Electronic Commerce Research and Applications, Vol. 11, No. 2(2011), 117-128.
  13. Langley, P., W. Iba, and K. Thompson, "An Analysis of Bayesian Classifiers," Proceedings of the National Conference on Artificial Intelligence (1992), 223-228.
  14. Lee, J. and J. Kwon, "A Hybrid SVM Classifier for Imbalanced Data Sets," Journal of Intelligence and Information Systems, Vol. 19, No. 2 (2013), 125-140.
  15. Lincoln, W. and J. Skrzypek, "Synergy of Clustering Multiple Back Propagation Networks," Advances in Neural Information Processing Systems, Vol. 2(1989), 650-659.
  16. Liu, B., Q. Cui, T. Jiang, and S. Ma, "A Combinational Feature Selection and Ensemble Neural Network Method for Classification of Gene Expression Data," Bioinformatics, Vol. 5, No. 136(2004), 51-131.
  17. Lu, C. and T. Chen, "A Study of Applying Data Mining Approach to the Information Disclosure for Taiwan's Stock Market Investors," Expert Systems With Applications, Vol. 36, No. 2 (2009), 3356-3542.
  18. Opitz, D. and R. Maclin, "Popular Ensemble Methods: An Empirical Study," Journal of Artificial Intelligence Research, Vol. 11 (1999), 169-198.
  19. Park, J. and K. Kwak, "The Effect of Patent Citation Relationship on Business Performance: A Social Network Analysis Perspective," Journal of Intelligence and Information Systems, Vol. 19, No. 3(2013), 127-139. https://doi.org/10.13088/jiis.2013.19.3.127
  20. Qin, B., Y. Xia, S. Prabhakar, and Y. Tu, "A Rule-Based Classification Algorithm for Uncertain Data," IEEE International Conference on Data Engineering (2009), 1633-1640.
  21. Quinlan, J. R., C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, Burlington, Massachusetts, 1993.
  22. Raudys, S. and V. Pikelis, "On Dimensionality, Sample Size, Classification Error, and Complexity of Classification Algorithm in Pattern Recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-2:3 (1980), 242-252.
  23. Rosset, S., "Model Selection via the AUC," Proceedings of the 21st International Conference on Machine Learning (2004).
  24. Sarunas, R., P. Vitalijus, and L.R. Moksly, "On Dimensionality, Sample Size, Classification Error, and Complexity of Classification Algorithm in Pattern Recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 2, No. 3 (2009), 242-252.
  25. Shi, L., X. Ma, L. Xi, and X. Hu, "Financial Data Mining Based on Support Vector Machines and Ensemble Learning," Proceedings of the 2010 International Conference on Intelligent Computation Technology and Automation, Changsha, China (2010), 313-314.
  26. Shi, L., L. Xi, X. Ma, M. Weng, and X. Hu, "A Novel Ensemble Algorithm for Biomedical Classification Based on Ant Colony Optimization," Applied Soft Computing, Vol. 11, No. 8(2011), 5674-5683. https://doi.org/10.1016/j.asoc.2011.03.025
  27. Theissen, E., "A Rest of the Accuracy of the Lee Ready Trade Classification Algorithm," Journal of International Financial Markets, Institutions and Money, Vol. 11, No. 2(2001), 147-165. https://doi.org/10.1016/S1042-4431(00)00048-2
  28. Vapnik, V. The Nature of Statistical Learning Theory. Springer Verlag New York, NY, USA (1995).
  29. Wu, C. and H. Xia, "Study of Personal Credit Evaluation Under C2C Environment Based on Support Vector Machines Ensemble," Proceedings on the 15th Annual Conference on Management Science and Engineering, Long Beach, CA (2008), 25-31.
  30. Yang, H. and I. King, "Ensemble Learning for Imbalanced E-Commerce Transaction Anomaly Classification," Lecture Notes in Computer Science, Vol. 5863/2009 (2009), 866-874.

Cited by

  1. Mining Intellectual History Using Unstructured Data Analytics to Classify Thoughts for Digital Humanities vol.24, pp.1, 2018, https://doi.org/10.13088/jiis.2018.24.1.141
  2. Business Application of Convolutional Neural Networks for Apparel Classification Using Runway Image vol.24, pp.3, 2014, https://doi.org/10.13088/jiis.2018.24.3.001