[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.7232/JKIIE.2015.41.1.025

A Study on Improving Classification Performance for Manufacturing Process Data with Multicollinearity and Imbalanced Distribution

Lee, Chae Jin (LG Home Entertainment Company)
Park, Cheong-Sool (School of Industrial Management Engineering, Korea University)
Kim, Jun Seok (School of Industrial Management Engineering, Korea University)
Baek, Jun-Geol (School of Industrial Management Engineering, Korea University)

Publication Information

Journal of Korean Institute of Industrial Engineers / v.41, no.1, 2015 , pp. 25-33 More about this Journal

Abstract

From the viewpoint of applications to manufacturing, data mining is a useful method to find the meaningful knowledge or information about states of processes. But the data from manufacturing processes usually have two characteristics which are multicollinearity and imbalance distribution of data. Two characteristics are main causes which make bias to classification rules and select wrong variables as important variables. In the paper, we propose a new data mining procedure to solve the problem. First, to determine candidate variables, we propose the multiple hypothesis test. Second, to make unbiased classification rules, we propose the decision tree learning method with different weights for each category of quality variable. The experimental result with a real PDP (Plasma display panel) manufacturing data shows that the proposed procedure can make better information than other data mining procedures.

Keywords

Multicollinearity; Imbalanced Data; Multiple Hypothesis Testing; Weighted Decision Tree; Plasma Display Panel;

Citations & Related Records

Times Cited By KSCI : 5 (Citation Analysis)

Reference
Cited By KSCI

1	Allison, P., Altman, M., Gill, J., and McDonald, M. P. (2004), Convergence problems in logistic regression, Numerical issues in statistical computing for the social scientist, 238-252.
2	Banks, D. L. and Giovanni P. (1991), Preanalysis of Superlarge Industrial Datasets, I (S) DS, Duke University, USA.
3	Benjamini, Y. and Hochberg, Y. (1995), Controlling the false discovery rate : A practical and powerful approach to multiple testing, Journal of the Royal Statistical Society : Series B(Methodological), 57, 289-300.
4	Boeuf, J. P. (2003), Plasma display panels : physics, recent developments and key issues, Journal of physics D : Applied physics, 36(6), R53. DOI
5	Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984), Classification and Regression Trees, Wadsworth, Califonia, USA
6	Byeon, S. K., Kang, C. W., and Sim S., B. (2004), Defect Type Prediction Method in Manufacturing Process Using Data Mining Technique, Journal of industrial and systems engineering, 27(2), 10-16.
7	Cunningham, Sean P., Costas, J. Spanos, and Katalin Voros. (1995), Semiconductor yield improvement : results and best practices, Semiconductor Manufacturing IEEE Transactions, 8(2), 103-109. DOI
8	Dudoit, S., Shaffer, J. P., and Boldrick, J. C. (2003), Multiple hypothesis testing in microarray experiments, Statistical Science, 18(1), 71-103. DOI
9	Farcomeni, A. (2008), A review of modern multiple hypothesis testing, with particular attention to the false discovery proportion, Statistical Methods in Medical Research, 17(4), 347-388. DOI
10	Fernandez, G. (2010), Statistical Data mining using SAS applications, 2nd edition, CRC press, New Yok, USA.
11	Gibbons, J. D. (1993), Nonparametric statistics : An introduction Vol. 90, Sage, California, USA.
12	HALL, Mark A. (1999), Correlation-based feature selection for machine learning, Ph.D. Thesis, The University of Waikato.
13	Hochberg, Y. and Tamhane, A. (1987), Multiple Comparison Procedures, Wiley, New York, USA.
14	Jang, Y. S., Kim J. W., and Hur J. (2008), Combined application of data imbalance reduction techniques using genetic algorithm, Journal of Intelligence and Information Systems, 14(3), 133-154.
15	Jang, W. C. (2013), Multiple testing and its applications in high-dimension, Journal of the Korean data & information science society, 24(5), 1063-1076. 과학기술학회마을 DOI ScienceOn
16	John, G. H., Kohavi, R., and Pfleger, K. (1994), Irrelevant features and the subset selection Problem, ICML, 94, 121-129.
17	Kim, J. H. and Jeong, J. B. (2004), Classification of class-imbalanced data : Effect of over-sampling and under-sampling of training data, The Korean Journal of Applied Statistics, 17(3), 445-457. 과학기술학회마을 DOI ScienceOn
18	Kubat, M., Holte, R., and Matwin, S. (1997), Learning when negative examples abound, Proceedings of the 9th European Conference on Machine Learning, ECML-97, 146-153.
19	Koksal, G., Batmaz, I., and Testik, M. C. (2011), A review of data mining applications for quality improvement in manufacturing industry, Expert Systems with Applications, 38(10), 13448-13467. DOI
20	Lemon, S. C., Roy, J., Clark, M. A., Friedmann, P. D., and Rakowski, W. (2003), Classification and regression tree analysis in public health : methodological review and comparison with logistic regression, Annals of Behavioral Medicine, 26(3), 172-181. DOI
21	Lin, W. J. and Chen, J. J. (2012), Class-imbalanced classifiers for high-dimensional data, Briefings in bioinformatics, 14(1), 13-26. DOI
22	Little, R. J. and Rubin, D. B. (2002), Statistical Analysis with Missing Data, 2nd edition, John Wiley and Sons, New York.
23	Park, J. H. and Byun, J. H. (2002), An analysis method of superlarge manufacturing process data using cleaning and graphical analysis, Journal of the Korean Society for Quality Management, 30(2), 72-85. 과학기술학회마을
24	Polo, J. L., Berzal, F., and Cubero, J. C. (2006), Taking class importance into account, In Hybrid Information Technology, ICHIT'06. International Conference on, 1, 1-6.
25	Pyle, D. (1999), Data preparation for data mining, Morgan Kaufmann, San Francisco, USA.
26	Shmueli, G., Patel, N. R., and Bruce, P. C. (2011), Data Mining for Business Intelligence : Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner, 2nd edition, Wiley, New York, USA.
27	Storey, J. D. (2002), A direct approach to false discovery rates. Journal of the Royal Statistical Society : Series B (Statistical Methodology), 64(3).
28	Weiss, G. M. and Provost, F. (2001), The effect of class distribution on classifier learning : an empirical study, Technical Report ML-TR-44, Department of Computer Science, Rutgers University.
29	Strobl, C., Boulesteix, A. L., Kneib, T., Augustin, T., and Zeileis, A. (2008), Conditional variable importance for random forests, BMC bioinformatics, 9(1), 307. DOI
30	Van Hulse, J., Khoshgoftaar, T. M., and Napolitano, A. (2007), Experimental perspectives on learning from imbalanced data, In Proceedings of the 24th international conference on Machine learning, 935-942.
31	Zeng, H. and Cheun, T. (2008), Feature selection for clustering high dimensional data, Lecture Notes in Artificial Intelligence, 5351, 913-922.

KSCI

A Study on Improving Classification Performance for Manufacturing Process Data with Multicollinearity and Imbalanced Distribution 다중공선성과 불균형분포를 가지는 공정데이터의 분류 성능 향상에 관한 연구

A Study on Improving Classification Performance for Manufacturing Process Data with Multicollinearity and Imbalanced Distribution