Browse > Article
http://dx.doi.org/10.7232/JKIIE.2015.41.1.025

A Study on Improving Classification Performance for Manufacturing Process Data with Multicollinearity and Imbalanced Distribution  

Lee, Chae Jin (LG Home Entertainment Company)
Park, Cheong-Sool (School of Industrial Management Engineering, Korea University)
Kim, Jun Seok (School of Industrial Management Engineering, Korea University)
Baek, Jun-Geol (School of Industrial Management Engineering, Korea University)
Publication Information
Journal of Korean Institute of Industrial Engineers / v.41, no.1, 2015 , pp. 25-33 More about this Journal
Abstract
From the viewpoint of applications to manufacturing, data mining is a useful method to find the meaningful knowledge or information about states of processes. But the data from manufacturing processes usually have two characteristics which are multicollinearity and imbalance distribution of data. Two characteristics are main causes which make bias to classification rules and select wrong variables as important variables. In the paper, we propose a new data mining procedure to solve the problem. First, to determine candidate variables, we propose the multiple hypothesis test. Second, to make unbiased classification rules, we propose the decision tree learning method with different weights for each category of quality variable. The experimental result with a real PDP (Plasma display panel) manufacturing data shows that the proposed procedure can make better information than other data mining procedures.
Keywords
Multicollinearity; Imbalanced Data; Multiple Hypothesis Testing; Weighted Decision Tree; Plasma Display Panel;
Citations & Related Records
Times Cited By KSCI : 5  (Citation Analysis)
연도 인용수 순위
1 Allison, P., Altman, M., Gill, J., and McDonald, M. P. (2004), Convergence problems in logistic regression, Numerical issues in statistical computing for the social scientist, 238-252.
2 Banks, D. L. and Giovanni P. (1991), Preanalysis of Superlarge Industrial Datasets, I (S) DS, Duke University, USA.
3 Benjamini, Y. and Hochberg, Y. (1995), Controlling the false discovery rate : A practical and powerful approach to multiple testing, Journal of the Royal Statistical Society : Series B(Methodological), 57, 289-300.
4 Boeuf, J. P. (2003), Plasma display panels : physics, recent developments and key issues, Journal of physics D : Applied physics, 36(6), R53.   DOI
5 Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984), Classification and Regression Trees, Wadsworth, Califonia, USA
6 Byeon, S. K., Kang, C. W., and Sim S., B. (2004), Defect Type Prediction Method in Manufacturing Process Using Data Mining Technique, Journal of industrial and systems engineering, 27(2), 10-16.
7 Cunningham, Sean P., Costas, J. Spanos, and Katalin Voros. (1995), Semiconductor yield improvement : results and best practices, Semiconductor Manufacturing IEEE Transactions, 8(2), 103-109.   DOI
8 Dudoit, S., Shaffer, J. P., and Boldrick, J. C. (2003), Multiple hypothesis testing in microarray experiments, Statistical Science, 18(1), 71-103.   DOI
9 Farcomeni, A. (2008), A review of modern multiple hypothesis testing, with particular attention to the false discovery proportion, Statistical Methods in Medical Research, 17(4), 347-388.   DOI
10 Fernandez, G. (2010), Statistical Data mining using SAS applications, 2nd edition, CRC press, New Yok, USA.
11 Gibbons, J. D. (1993), Nonparametric statistics : An introduction Vol. 90, Sage, California, USA.
12 HALL, Mark A. (1999), Correlation-based feature selection for machine learning, Ph.D. Thesis, The University of Waikato.
13 Hochberg, Y. and Tamhane, A. (1987), Multiple Comparison Procedures, Wiley, New York, USA.
14 Jang, Y. S., Kim J. W., and Hur J. (2008), Combined application of data imbalance reduction techniques using genetic algorithm, Journal of Intelligence and Information Systems, 14(3), 133-154.
15 Jang, W. C. (2013), Multiple testing and its applications in high-dimension, Journal of the Korean data & information science society, 24(5), 1063-1076.   과학기술학회마을   DOI   ScienceOn
16 John, G. H., Kohavi, R., and Pfleger, K. (1994), Irrelevant features and the subset selection Problem, ICML, 94, 121-129.
17 Kim, J. H. and Jeong, J. B. (2004), Classification of class-imbalanced data : Effect of over-sampling and under-sampling of training data, The Korean Journal of Applied Statistics, 17(3), 445-457.   과학기술학회마을   DOI   ScienceOn
18 Kubat, M., Holte, R., and Matwin, S. (1997), Learning when negative examples abound, Proceedings of the 9th European Conference on Machine Learning, ECML-97, 146-153.
19 Koksal, G., Batmaz, I., and Testik, M. C. (2011), A review of data mining applications for quality improvement in manufacturing industry, Expert Systems with Applications, 38(10), 13448-13467.   DOI
20 Lemon, S. C., Roy, J., Clark, M. A., Friedmann, P. D., and Rakowski, W. (2003), Classification and regression tree analysis in public health : methodological review and comparison with logistic regression, Annals of Behavioral Medicine, 26(3), 172-181.   DOI
21 Lin, W. J. and Chen, J. J. (2012), Class-imbalanced classifiers for high-dimensional data, Briefings in bioinformatics, 14(1), 13-26.   DOI
22 Little, R. J. and Rubin, D. B. (2002), Statistical Analysis with Missing Data, 2nd edition, John Wiley and Sons, New York.
23 Park, J. H. and Byun, J. H. (2002), An analysis method of superlarge manufacturing process data using cleaning and graphical analysis, Journal of the Korean Society for Quality Management, 30(2), 72-85.   과학기술학회마을
24 Polo, J. L., Berzal, F., and Cubero, J. C. (2006), Taking class importance into account, In Hybrid Information Technology, ICHIT'06. International Conference on, 1, 1-6.
25 Pyle, D. (1999), Data preparation for data mining, Morgan Kaufmann, San Francisco, USA.
26 Shmueli, G., Patel, N. R., and Bruce, P. C. (2011), Data Mining for Business Intelligence : Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner, 2nd edition, Wiley, New York, USA.
27 Storey, J. D. (2002), A direct approach to false discovery rates. Journal of the Royal Statistical Society : Series B (Statistical Methodology), 64(3).
28 Weiss, G. M. and Provost, F. (2001), The effect of class distribution on classifier learning : an empirical study, Technical Report ML-TR-44, Department of Computer Science, Rutgers University.
29 Strobl, C., Boulesteix, A. L., Kneib, T., Augustin, T., and Zeileis, A. (2008), Conditional variable importance for random forests, BMC bioinformatics, 9(1), 307.   DOI
30 Van Hulse, J., Khoshgoftaar, T. M., and Napolitano, A. (2007), Experimental perspectives on learning from imbalanced data, In Proceedings of the 24th international conference on Machine learning, 935-942.
31 Zeng, H. and Cheun, T. (2008), Feature selection for clustering high dimensional data, Lecture Notes in Artificial Intelligence, 5351, 913-922.