Comparison of Partial Least Squares and Support Vector Machine for the Flash Point Prediction of Organic Compounds

유기물의 인화점 예측을 위한 부분최소자승법과 SVM의 비교

  • Lee, Chang Jun (Department of Chemical and Biological Engineering, Seoul National University) ;
  • Ko, Jae Wook (Department of Chemical Engineering, Kwangwoon University) ;
  • Lee, Gibaek (Department of Chemical and Biological Engineering, Chungju National University)
  • 이창준 (서울대학교 화학생물공학부) ;
  • 고재욱 (광운대학교 화학공학과) ;
  • 이기백 (충주대학교 화공생물공학과)
  • Received : 2010.07.28
  • Accepted : 2010.09.01
  • Published : 2010.12.31

Abstract

The flash point is one of the most important physical properties used to determine the potential for fire and explosion hazards of flammable liquids. Despite the needs of the experimental flash point data for the design and construction of chemical plants, there is often a significant gap between the demands for the data and their availability. This study have built and compared two models of partial least squares(PLS) and support vector machine(SVM) to predict the experimental flash points of 893 organic compounds out of DIPPR 801. As the independent variables of the models, 65 functional groups were chosen based on the group contribution method that was oriented from the assumption that each fragment of a molecule contributes a certain amount to the value of its physical property, and the logarithm of molecular weight was added. The prediction errors calculated from cross-validation were employed to determine the optimal parameters of two models. And, an optimization technique should be used to get three parameters of SVM model. This work adopted particle swarm optimization that is one of heuristic optimization methods. As the selection of training data can affect the prediction performance, 100 data sets of randomly selected data were generated and tested. The PLS and SVM results of the average absolute errors for the whole data range from 13.86 K to 14.55 K and 7.44 K to 10.26 K, respectively, indicating that the predictive ability of the SVM is much superior than PLS.

액체의 화재 및 폭발위험을 나타내는 가장 중요한 물성의 하나인 인화점의 실험 데이터는 그 필요에도 불구하고 실제로 데이터를 확보하는 것이 가능하지 않은 경우가 많다. 이 연구에서는 DIPPR 801에서 얻은 893개 유기물의 인화점 실험데이터로부터 인화점을 예측하는 부분최소자승법(PLS) 및 support vector machine(SVM) 모델을 만들고 비교하였다. 분자를 구성하는 각 구성요소들이 분자의 물성에 일정한 기여를 한다는 가정을 이용하여 분자의 물성을 예측하는 방법인 그룹기여법을 이용하여 65개 작용기가 이 예측모델의 독립변수가 되었고 분자량의 로그값이 추가되었다. 두 모델에서 결정해야 할 매개변수는 교차검증에서 계산된 오차를 이용하여 결정되었는데, SVM모델은 그 매개변수가 많아 particle swarm optimization을 이용한 최적화를 이용하였다. 훈련데이터의 선택이 예측성능에 영향을 줄 수 있어 임의로 100개의 데이터 세트를 생성하여 테스트하였다. 전체 데이터에 대해 계산된 평균절대오차는 PLS가 13.86~14.55였고, SVM이 7.44~10.26여서 SVM이 PLS에 비해 매우 우수한 예측성능을 보였다.

Keywords

References

  1. Katritzky, A. R., Petrukhin, R., Jain, R. and Karelson, M., "QSPR Analysis of Flash Points," J. Chem. Inf. Comput. Sci., 41(6), 1521-1530(2001). https://doi.org/10.1021/ci010043e
  2. Crowl, D. A. and Louvar, J. F., Chemical Process Safety: Fundamentals with Applicatoins, 2nd Ed., Prentice Hall, Upper Saddle River, NJ(2001).
  3. Vidal, M., Rogers, W. J. Holste, J. C. and Mannan, M. S., "A Review of Estimation Methods for Flash Points and Flammability Limits," Process Saf. Prog., 23(1), 47-55(2004). https://doi.org/10.1002/prs.10004
  4. Suzuki, T., Ohtaguchi, K. and Koide, K., "A Method for Estimating Flash Points of Organic Compounds from Molecular Structures," J. Chem. Eng. Jpn., 24(2), 258-261(1991). https://doi.org/10.1252/jcej.24.258
  5. Tetteh, J., Suzuki, T., Metcalfe, E. and Howells, S., "Quantitative Structure-Property Relationships for the Estimation of Boiling Point and Flash Point Using a Radial Basis Function Neural Network," J. Chem. Inf. Comput. Sci., 39(3), 491-507(1999). https://doi.org/10.1021/ci980026y
  6. Katritzky, A. R., Stoyanova-Slavova, I. B., Dobchev, D. A. and Karelson, M., "QSPR Modeling of Flash Points: An Update," J. Mol. Graph. Model., 26(2), 529-536(2007). https://doi.org/10.1016/j.jmgm.2007.03.006
  7. Gharagheizi, F. and Alamdari, R. F., "Prediction of Flash Point Temperature of Pure Components Using a Quantitative Structure-Property Relationship Model," QSAR Comb. Sci., 27(8), 679-683 (2008). https://doi.org/10.1002/qsar.200730110
  8. Pan, Y., Jiang, J., Wang, R., Cao, H. and Zhao, J., "Quantitative Structure-Property Relationship Studies for Predicting Flash Points of Organic Compounds using Support Vector Machines," QSAR Comb. Sci., 27(8), 1013-1019(2008). https://doi.org/10.1002/qsar.200810009
  9. Patel, S. J., Ng, D. and Mannan, M. S., "QSPR Flash Point Prediction of Solvents Using Topological Indices for Application in Computer Aided Molecular Design," Ind. Eng. Chem. Res., 48(15), 7378-7387(2009). https://doi.org/10.1021/ie9000794
  10. http://michem.disat.unimib.it/mole_db/
  11. Constantinou, L. and Gani, R., "New Group Contribution Method for Estimating Properties of Pure Compounds," AIChE Jr., 40(10), 1697-1710(1994). https://doi.org/10.1002/aic.690401011
  12. Wen, X. and Qiang, Y., "A New Group Contribution Method for Estimating Critical Properties of Organic Compounds," Ind. Eng. Chem. Res., 40(26), 6245-6250(2001). https://doi.org/10.1021/ie010374g
  13. Albahri, T. A., "Structural Group Contribution Method for Predicting the Octane Number of Pure Hydrocarbon Liquids," Ind. Eng. Chem. Res., 42(3), 657-662(2003). https://doi.org/10.1021/ie020306+
  14. Zbransk, Z. K. K. and Rika, V., "Estimation of the Heat Capacity of Organic Liquids as a Function of Temperature by a Three-Level Group Contribution Method," Ind. Eng. Chem. Res., 47(6), 2075-2085(2008). https://doi.org/10.1021/ie071228z
  15. Lee, C. J., Lee, G., So, W. and Yoon, E. S., "A New Estimation Algorithm of Physical Properties based on a Group Contribution and Support Vector Machine," Korean J. Chem. Eng. (HWAHAK KONGHAK), 25(3), 568-574(2008). https://doi.org/10.1007/s11814-008-0096-0
  16. http://dippr.byu.edu/.
  17. Lee, H. D., Lee., M. H., Cho, H. W., Han, C. and Chang, K. S., "Online Quality Monitoring Using Multivariate Statistical Methods in Continuous-stirred MMA-VA Copolymerization Process", HWAHAK KONGHAK, 35(5), 605-612(1997).
  18. Russell, E. L., Chiang, L. H. and Braatz, R. D., Data-driven Techniques for Fault Detection and Diagnosis in Chemical Processes, Springer-Verlag, London(2000).
  19. Vapnik, V. N., The Nature of Statistical Learning Theory, Springer-Verlag, New York, NY(1995).
  20. http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
  21. Schwwab, M., Biscaia, E. C., Monteiro, J. L. and Pinto, J. C., "Nonlinear Parameter Estimation through Particle Swarm Optimization," Chem. Eng. Sci., 63(6), 1542-1552(2008). https://doi.org/10.1016/j.ces.2007.11.024