DOI QR코드

DOI QR Code

투영 조합을 통한 빅데이터 앙상블 모형

Ensemble model through mixed projections useful for big data analytics

  • 박혜준 (연세대학교 응용통계학과) ;
  • 김현중 (연세대학교 응용통계학과) ;
  • 이영섭 (동국대학교 통계학과)
  • Hyejoon Park (Department of Applied Statistics, Yonsei University) ;
  • Hyunjoong Kim (Department of Applied Statistics, Yonsei University) ;
  • Yung-Seop Lee (Department of Statistics, Dongguk University)
  • 투고 : 2024.07.30
  • 심사 : 2024.08.19
  • 발행 : 2024.10.31

초록

이 논문에서는 빅데이터 분석 분야에서 유용하게 사용할 수 있는 새로운 분류 앙상블 방법인 mixed projection forest (MPF)를 제안하였다. 앙상블 내 개별 분류기를 학습할 때, MPF는 주성분 분석(PCA)과 정준 선형 판별 분석(CDA) 등의 데이터 투영 기법의 조합에 의한 회전 행렬을 활용한다. 이를 통해 경사 초평면을 사용함으로써 각 분류기의 정확성을 향상시킨다. 또한 변수 집합의 랜덤 분할을 이용해 다양한 회전 행렬을 도출하여 개별 분류기들의 다양성을 증대시킨다. 이러한 접근 방식은 궁극적으로 분류 성능을 향상시켜 정밀도가 필요한 빅데이터 분석에 매우 효과적이다. 이 논문에서는 실제 및 가상의 30개 데이터셋을 사용하여 MPF와 전통적인 분류 앙상블 모형의 성능을 비교하였다. 결과적으로, MPF는 분류 성능 및 분류기의 다양성 측면에서 우수한 경쟁력을 가진다는 것을 확인할 수 있었다.

In this paper, we propose mixed projection forest (MPF), a new classification ensemble method that can be effectively applied in the field of big data analysis. When training individual classifiers within an ensemble, MPF uses oblique hyperplanes using combined rotation matrix derived from data projection techniques of principal component analysis (PCA) and canonical linear discriminant analysis (CLDA), thereby improving the accuracy of each classifier. Additionally, the diversity of individual classifiers is improved by generating various rotation matrices through random partitioning of the input variable set. This approach ultimately enhances classification performance and proves to be highly effective in big data analysis that demands precision. We conducted a performance comparison of MPF with existing classification ensemble models using 30 real or simulated datasets. The results indicate that MPF achieves competitive performance in terms of classification accuracy and classifier diversity.

키워드

과제정보

Hyunjoong Kim's work was supported by the MSIT (Ministry of Science and ICT), Korea, under the ICAN (ICT Challenge and Advanced Network of HRD) support program (IITP-2023-00259934) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation) and by the National Research Foundation of Korea (NRF) grant funded by the Korean government (No. 2016R1D1A1B02011696). Yung-Seop Lee's work was supported by the National Research Foundation(NRF) grant funded by the Korea government (MSIT) (No.2021R1A2C1007095) and by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2020-2020-0-01789) supervised by the IITP (Institute of Information & Communications Technology Planning & Evaluation).

참고문헌

  1. Alcala-FJ, Fernandez A, Luengo J, Derrac J, Garcia S, Sanchez L, and Herrera F (2011). KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework, Journal of Multiple-Valued Logic & Soft Computing, 17, 255-287.
  2. Asuncion A and Newman DJ (2007). UCI machine learning repository, Retrieved Oct. 02, 2018, Available from: http://archive.ics.uci.edu/ml/
  3. Blaser R and Fryzlewicz P (2016). Random rotation ensembles, The Journal of Machine Learning Research, 17, 126-151.
  4. Breiman L (1996). Bagging predictors, Machine Learning, 24, 123-140.
  5. Breiman L (2001). Random forests, Machine Learning, 45, 5-32.
  6. Chen YC, Ha H, Kim H, and Ahn H (2014). Canonical forest, Computational Statistics, 29, 849-867.
  7. Cohen J (1960). A coefficient of agreement for nominal scales, Educational and Psychological Measurement, 20, 37-46.
  8. Fukunaga K (2013). Introduction to Statistical Pattern Recognition, Elsevier, Amsterdam, Netherlands.
  9. Jolliffe IT (2002). Principal Component Analysis for Special Types of Data, Springer, New York.
  10. Kim H, Kim H, Moon H, and Ahn H (2011). A weight-adjusted voting algorithm for ensembles of classifiers, Journal of the Korean Statistical Society, 40, 437-449.
  11. Lim TS, Loh WY, and Shih Y (2000). A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms, Machine Learning, 40, 203-228.
  12. Loh WY (2009). Improving the precision of classification trees, The Annals of Applied Statistics, 3, 1710-1737.
  13. Rodriguez JJ, Kuncheva LI, and Alonso CJ (2006). Rotation forest: A new classifier ensemble method, IEEE Transactions on Pattern Analysis and Machine Intelligence, 28, 1619-1630.
  14. Terhune JM (1994). Geographical variation of harp seal underwater vocalizations, Canadian Journal of Zoology, 72, 892-897.
  15. Vlachos R (2010). StatLib datasets archive, Retrieved Oct. 02, 2018, Available from: http://lib.stat.cmu.edu/datasets