DOI QR코드

DOI QR Code

Detecting outliers in multivariate data and visualization-R scripts

다변량 자료에서 특이점 검출 및 시각화 - R 스크립트

  • Kim, Sung-Soo (Department of Information Statistics, Korea National Open University)
  • 김성수 (한국방송통신대학교 정보통계학과)
  • Received : 2018.06.28
  • Accepted : 2018.08.01
  • Published : 2018.08.31

Abstract

We provide R scripts to detect outliers in multivariate data and visualization. Detecting outliers is provided using three approaches 1) Robust Mahalanobis distance, 2) High Dimensional data, 3) density-based approach methods. We use the following techniques to visualize detected potential outliers 1) multidimensional scaling (MDS) and minimal spanning tree (MST) with k-means clustering, 2) MDS with fviz cluster, 3) principal component analysis (PCA) with fviz cluster. For real data sets, we use MLB pitching data including Ryu, Hyun-jin in 2013 and 2014. The developed R scripts can be downloaded at "http://www.knou.ac.kr/~sskim/ddpoutlier.html" (R scripts and also R package can be downloaded here).

다변량 자료에서 특이점을 검출하고, 검출된 특이점을 시각화와 연결한 R 스크립트를 제공한다. 개발된 R 스크립트는 특이점을 검출하는 방법으로서 1) Robust Mahalanobis distance, 2) High Dimensional data, 3) Density-based approach 방법을 이용하였다. 특이점을 연결하면서 데이터 구조를 파악하기 위한 시각화 방법으로는 1) multidimensional scaling (MDS)와 minimal spanning tree (MST)를 K-means 군집분석과 연결하여 표시하는 방법, 2) MDS를 fviz cluster와 연결하는 방법, 3) principal component analysis (PCA)를 fviz cluster와 연결한 방법을 이용하였다. 사례분석의 예로서는 Major League Baseball (MLB) 자료에서 류현진이 적극적으로 활동하던 2013년, 2014년 투수자료를 이용하였다. 개발된 R 스트립트는 "http://www.knou.ac.kr/~sskim/ddpoutlier.html (R 스크립트와 R 패키지도 다운로드 받을 수 있다. 실행방법도 설명되어 있다.)"에서 다운받으면 된다.

Keywords

References

  1. Barnett, V. and Lewis, T. (1994). Outliers in Statistical Data (3rd ed), John Wiley & Sons, Chichester.
  2. Breunig, M., Kriegel, H., Ng, R., and Sander, J. (2000). LOF: identifying density-based local outliers. In SIGMOD '00 Proceedings of the 2000 ACM SIGMOD International Conference on Management of data, Texas, 93-104.
  3. Butler, R. W., Davies, P. L., and Jhun, M. (1993). Asymptotics for the minimum covariance determinant estimator, The Annals of Statistics, 21, 1385-1400. https://doi.org/10.1214/aos/1176349264
  4. Charrad, M., Ghazzali, N., Boiteau, V., and Niknafs, A. (2014). NbClust: an R package for determining the relevant number of clusters in a data set, Journal of Statistical Software, 61, 1-36.
  5. Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD'96 Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Oregon, 226-231.
  6. Filzmoser, P. (2004). A multivariate outlier detection method, from: http://file.statistik.tuwien.ac.at/filz/papers/minsk04.pdf
  7. Filzmoser, P., Maronna, R., and Werner, M. (2008). Outlier identification in high dimensions, Computational Statistics & Data Analysis, 52, 1694-1711. https://doi.org/10.1016/j.csda.2007.05.018
  8. Hawkins, D. M. (1980). Identication of Outliers, Chapman & Hall, London.
  9. Jayakumar, D. S. and Thomas, B. J. (2013). A new procedure of clustering based on multivariate outlier detection, Journal of Data Science, 11, 69-84.
  10. Kassambara, A. (2017). Practical Guide to Cluster Analysis in R: Unsupervised Machine Learning, STHDA.
  11. Kim, S., Kwon, S., and Cook, D. (2000). Interactive visualization of hierarchical clusters using MDS and MST, Metrika, 51, 39-51. https://doi.org/10.1007/s001840000043
  12. Kriegel, H.-P., Kroger, P., and Zimek, A. (2010). Outlier detection techniques, The 2010 SIAM International Conference on Data Mining.
  13. Mahalanobis, P. C. (1936). On the generalized distance in statistics. In Proceedings of the National Institute of Sciences (Calcutta), India, 2, 49-55.
  14. Mojena, R. (1977). Hierarchical grouping methods and stopping rules: an evaluation, The Computer Journal, 20, 359-363. https://doi.org/10.1093/comjnl/20.4.359
  15. Mojena, R. and Wishart, D. (1980). Stopping rules for Ward's clustering method. In COMPSTAT 1980 Proceedings, Physica-Verlag, 426-432.
  16. Pamula, R., Deka, J. K., and Nandi, S. (2011). An outlier detection method based on clustering, 2011 Second International Conference on Emerging Applications of Information Technology, 253-256.
  17. Penny, K. I. and Jolliffe, I. T. (2001). A comparison of multivariate outlier detection methods for clinical laboratory safety data, Journal of the Royal Statistical Society. Series D (The Statistician), 50, 295-308. https://doi.org/10.1111/1467-9884.00279
  18. Prim, R. C. (1957). Shortest connection networks and some generalizations, Bell System Technical Journal, 36, 1389-1401. https://doi.org/10.1002/j.1538-7305.1957.tb01515.x
  19. Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, Journal of Computational and Applied Mathematics, 20, 53-65. https://doi.org/10.1016/0377-0427(87)90125-7
  20. Rousseeuw, P. J., Ruts, I., and Tukey, J. W. (1999). The Bagplot: a bivariate boxplot, The American Statistician, 53, 382-387.
  21. Rousseeuw, P. J. and Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator, Technometrics, 41, 212-223. https://doi.org/10.1080/00401706.1999.10485670
  22. Tibshirani, R., Walther, G., and Hastie, T. (2001), Estimating the number of clusters in a data set via the gap statistic, Journal of Royal Statistical Society: Series B (Statistical Methodology), 63, 411-423. https://doi.org/10.1111/1467-9868.00293
  23. Wickham, H. (2010). ggplot2: Elegant Graphics for Data Analysis, Journal of Statistical Software, 35, Book Review 1.