DOI QR코드

DOI QR Code

랜덤 투영 앙상블 기법을 활용한 적응 최근접 이웃 판별분류기법

Random projection ensemble adaptive nearest neighbor classification

  • 강종경 (고려대학교 통계학과) ;
  • 전명식 (한국뉴욕주립대학교 응용수학통계학과)
  • Kang, Jongkyeong (Department of Statistics, Korea University) ;
  • Jhun, Myoungshic (Department of Applied Mathematics and Statistics, The State University of New York Korea)
  • 투고 : 2021.03.30
  • 심사 : 2021.04.25
  • 발행 : 2021.06.30

초록

판별분류분석에서 널리 이용되는 k-최근접 이웃 분류 방법은 고정된 이웃의 수만을 고려하여 자료의 국소적 특징을 반영하지 못하는 한계가 있다. 이에 자료의 국소적 구조를 고려하여 이웃의 개수를 선택하는 적응 최근접이웃방법이 개발된 바 있다. 고차원 자료의 분석에 있어서는 k-최근접 이웃 분류를 사용하기 전에 랜덤 투영 기법 등을 활용하여 차원 축소를 수행하는 것이 일반적이다. 이렇게 랜덤 투영시킨 다수의 분류 결과들을 면밀히 조합하여 투표를 통해 최종 할당을 하는 기법이 최근 개발된 바 있다. 본 연구에서는 고차원 자료에서의 분석을 위해 적응 최근접이웃방법과 랜덤 투영 앙상블 기법을 조합한 새로운 판별분류 기법을 제안하였다. 제안된 방법은 기존에 개발된 방법에 비해 분류 정확성 측면에서 더 뛰어남을 모의실험 및 실제 사례 분석을 통해 확인하였다.

Popular in discriminant classification analysis, k-nearest neighbor classification methods have limitations that do not reflect the local characteristic of the data, considering only the number of fixed neighbors. Considering the local structure of the data, the adaptive nearest neighbor method has been developed to select the number of neighbors. In the analysis of high-dimensional data, it is common to perform dimension reduction such as random projection techniques before using k-nearest neighbor classification. Recently, an ensemble technique has been developed that carefully combines the results of such random classifiers and makes final assignments by voting. In this paper, we propose a novel discriminant classification technique that combines adaptive nearest neighbor methods with random projection ensemble techniques for analysis on high-dimensional data. Through simulation and real-world data analyses, we confirm that the proposed method outperforms in terms of classification accuracy compared to the previously developed methods.

키워드

과제정보

이 논문은 2020년도 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 기초연구사업임(NRF-2020R1F1A1A01061746)

참고문헌

  1. Bingham E and Mannila H (2001). Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, 245-250.
  2. Cannings TI and Samworth RJ (2017). Random-projection ensemble classification, Journal of the Royal Statistical Society Series B: Statistical Methodology, 79, 959-1035. https://doi.org/10.1111/rssb.12228
  3. Chang DJ and Kwon Y M (2008). Medical diagnosis inference using neural network and discriminant analyses, Journal of Korean Data & Information Science Society, 19, 511-518.
  4. Devroye L, Gyorfi L, and Lugosi G (1996). A Probabilistic Theory of Pattern Recognition, Springer, New York.
  5. Fix E and Hodges JL (1951). Discriminatory analysis - nonparametric discrimination: consistency properties, Technical Report 4, Texas, United States.
  6. Friedman J (1994). Flexible Metric Nearest Neighbor Classification, Standford University, United States.
  7. Hall P, Park BU, and Samworth RJ (2008). Choice of neighbour order in nearest-neighbour classification, Annals of Statististics, 36, 2135-2152.
  8. Hastie T and Tibshirani R (1996). Discriminant adaptive nearest neighbor classification, IEEE Transactions on Pattern Analysis and Machine Intelligence, 18, 607-616. https://doi.org/10.1109/34.506411
  9. Jhun MS and Choi IK (2009). Adaptive nearest neighbors for classification. The Korean Journal of Applied Statistics, 22, 479-488. https://doi.org/10.5351/KJAS.2009.22.3.479
  10. Johnson WB and Lindenstrauss J (1984). Extensions of Lipschitz mappings into a Hilbert space, Contemporary Mathematics, 26 189-206. https://doi.org/10.1090/conm/026/737400
  11. Kang JK and Jhun MS (2020a). Divide-and-conquer random sketched kernel ridge regression for large-scale data analysis, Journal of Korean Data & Information Science Society, 31, 15-23. https://doi.org/10.7465/jkdi.2020.31.1.15
  12. Kang JK and Jhun MS (2020b). Variable selection in reproducing kernel Hilbert space using random sketch method, Journal of Korean Data & Information Science Society, 31, 501-511. https://doi.org/10.7465/jkdi.2020.31.4.501
  13. Kang SA, Kim YS, and Choi SH (2015). Study on the social issue sentiment classification using text mining, Journal of Korean Data & Information Science Society, 26, 1167-1173. https://doi.org/10.7465/jkdi.2015.26.5.1167
  14. Kim TH and Kim YH (2013). A study on the analysis of customer loan for the credit finance company using classification model, Journal of Korean Data & Information Science Society, 24, 411-425. https://doi.org/10.7465/jkdi.2013.24.3.411
  15. Mohino-Herranz I, Gil-Pita R, Rosa-Zurera M, and Seoane F (2019). Activity recognition using wearable physiological measurements: selection of features from a comprehensive literature study, Sensors, 19, 5524. https://doi.org/10.3390/s19245524
  16. Shasha DE and Zhu Y (2004). High Performance Discovery in Time Series: Techniques and Case Studies. Springer Science & Business Media, New York.
  17. Trefethen LN and Bau D III. (1997). Numerical linear algebra, Society for Industrial and Applied Mathematics, Philadelphia, United States.