Browse > Article
http://dx.doi.org/10.3745/KIPSTB.2007.14-B.4.287

Improved Focused Sampling for Class Imbalance Problem  

Kim, Man-Sun (전남대학교)
Yang, Hyung-Jeong (전남대학교 전자컴퓨터공학부)
Kim, Soo-Hyung (전남대학교 전자컴퓨터공학부)
Cheah, Wooi Ping (전남대학교 전산학과)
Abstract
Many classification algorithms for real world data suffer from a data class imbalance problem. To solve this problem, various methods have been proposed such as altering the training balance and designing better sampling strategies. The previous methods are not satisfy in the distribution of the input data and the constraint. In this paper, we propose a focused sampling method which is more superior than previous methods. To solve the problem, we must select some useful data set from all training sets. To get useful data set, the proposed method devide the region according to scores which are computed based on the distribution of SOM over the input data. The scores are sorted in ascending order. They represent the distribution or the input data, which may in turn represent the characteristics or the whole data. A new training dataset is obtained by eliminating unuseful data which are located in the region between an upper bound and a lower bound. The proposed method gives a better or at least similar performance compare to classification accuracy of previous approaches. Besides, it also gives several benefits : ratio reduction of class imbalance; size reduction of training sets; prevention of over-fitting. The proposed method has been tested with kNN classifier. An experimental result in ecoli data set shows that this method achieves the precision up to 2.27 times than the other methods.
Keywords
Unsupervised Learning; SOM(Self Organizing Map); BMU(Best Matching Unit); Focused Sampling;
Citations & Related Records
연도 인용수 순위
  • Reference
1 T. Fawcett, F. Provost, 'Adaptive Fraud Detection, Data Mining and Knowledge Discovery,' Vol.1, No.3, pp. 291-316, 1997   DOI
2 S. Cho, H. Shin, E. Yu, K. Ha, and D. MacLachlan, 'Data Mining Problems and Solutions for Response Modeling in CRM,' Entrue Journal of Information Technology, Vol.5, No.1, pp.55-64, 2006
3 L. Bruzzone, D. Fernandez Prieto, 'A Combined Supervised and Unsupervised Approach to Classification of Multi Temporal Remote Sensing Images,' In Proceedings of the IEEE 2000 International Geoscience and Remote Sensing Symposium (IGARSS), Honolulu, Hawaii, 24-28, Vol. 1, pp. 162- 164, July, 2000   DOI
4 R. Yan, Y. Liu, R. Jin, A. Hauptmann, 'On Predicting Rare Classes With SVM Ensembles In Scene Classification,' IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 21-24, April, 2003   DOI
5 G. M, Foody, 'The Significance of Border Training Patterns in Classification By A Feedforword Neural Network Using Back Propagation Learning,' International Journal of Remote Sensing, Vol.20, No.18, pp. 3549-3562, 1999   DOI
6 M. Kubat, S. Matwin, 'Addressing the Curse of Imbalanced Data Sets: One-Sided Sampling,' Proceedings of the Fourteenth International Conference on Machine Learning , pp. 179-186, 1997
7 N. Chawla, N. Japkowicz, A. Kolcz, Special Issue on Class Imbalances, SIGKDD Explorations 6(1), June 2004
8 Foster Provost, 'Machine Learning from Imbalanced Data Sets 101,' Learning from Imbalanced Data Sets Papers from the AAAI Workshop, 2005
9 X. Liu, J. Wu, Z. Zhou, 'Exploratory Under-Sampling for Class-Imbalance Learning,' International Conference on Data Mining(ICDM) pp. 965-969, 2006   DOI
10 H. Shin and S. Cho, 'Fast Pattern Selection for Support Vector Classifiers,' 7th Pacific-Asia Conference, PAKDD 2003, Seoul, Korea, April 30 - May 2, 2003
11 Yanmin Sun, Mohamed S. Kamel, Andrew K.C. Wong and Yang Wang, 'Cost-sensitive boosting for classification of imbalanced data,' Pattern Recognition, In Press, Corrected Proof, Available online 5 May 2007   DOI   ScienceOn
12 Guobin Ou and Yi Lu Murphey, 'Multi-class pattern classification using neural networks,' Pattern Recognition, Vol 40, Issue 1, pp. 4-18, 2007   DOI   ScienceOn
13 Jigang Xie and Zhengding Qiu, 'The effect of imbalanced data sets on LDA: A theoretical and empirical analysis,' Pattern Recognition, Vol 40, Issue 2, pp. 557-562, 2007   DOI   ScienceOn
14 http://www.ics.uci.edu/~mlearn/databases/
15 신 현정, 조 성준, '신경망 앙상블의 편기와 분산을 이용한 '분류' 패턴 선택,' 한국정보과학회 추계학술대회, 2001
16 Mixture of Expert Agents for Handling Imbalanced Data Sets, annals of mathematics, computing & teleinformatics, Vol 1, no 1, pp. 46-55, 2003
17 Yang Liu, Nitesh V. Chawla, Mary P. Harper, Elizabeth Shriberg and Andreas Stolcke, 'A study in machine learning from imbalanced data for sentence boundary detection in speech,' Computer Speech & Language, Vol 20, Issue 4, pp. 468-494, 2006   DOI   ScienceOn
18 Teuvo Kohonen, Self-Organizind Maps:Second Edition, Springer, 1997
19 Vicenc Soler, Jesus Cerquides, Iosep Sabria, Iordi Roig, Marta Prim, Imbalanced Datasets Classification by Fuzzy Rule Extraction and Genetic Algorithms, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06), pp. 330-336, 2006   DOI
20 N. Japkowicz, 'The Class Imbalance Problem: Significance and Strategies,' in Proceedings of the 2000 International Conference on Artificial Intelligence, pp. 111-117, 2000
21 오장민, 장병탁, '불균형 데이터의 효과적 학습을 위한 커널 퍼셉트론 부스팅 기법,' 한국정보과학회 2001년도 봄 학술발표논문집 제28권 제1호(B), pp. 304-306, 2001