Detecting outliers in multivariate data and visualization-R scripts

Kim, Sung-Soo;

doi:10.5351/KJAS.2018.31.4.517

The Korean Journal of Applied Statistics (응용통계연구)

Volume 31 Issue 4
/
Pages.517-528
/
2018
/
1225-066X(pISSN)
/
2383-5818(eISSN)

The Korean Statistical Society (한국통계학회)

DOI QR Code

Detecting outliers in multivariate data and visualization-R scripts

다변량 자료에서 특이점 검출 및 시각화 - R 스크립트

Kim, Sung-Soo (Department of Information Statistics, Korea National Open University)

김성수 (한국방송통신대학교 정보통계학과)

Received : 2018.06.28
Accepted : 2018.08.01
Published : 2018.08.31

https://doi.org/10.5351/KJAS.2018.31.4.517 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

We provide R scripts to detect outliers in multivariate data and visualization. Detecting outliers is provided using three approaches 1) Robust Mahalanobis distance, 2) High Dimensional data, 3) density-based approach methods. We use the following techniques to visualize detected potential outliers 1) multidimensional scaling (MDS) and minimal spanning tree (MST) with k-means clustering, 2) MDS with fviz cluster, 3) principal component analysis (PCA) with fviz cluster. For real data sets, we use MLB pitching data including Ryu, Hyun-jin in 2013 and 2014. The developed R scripts can be downloaded at "http://www.knou.ac.kr/~sskim/ddpoutlier.html" (R scripts and also R package can be downloaded here).

다변량 자료에서 특이점을 검출하고, 검출된 특이점을 시각화와 연결한 R 스크립트를 제공한다. 개발된 R 스크립트는 특이점을 검출하는 방법으로서 1) Robust Mahalanobis distance, 2) High Dimensional data, 3) Density-based approach 방법을 이용하였다. 특이점을 연결하면서 데이터 구조를 파악하기 위한 시각화 방법으로는 1) multidimensional scaling (MDS)와 minimal spanning tree (MST)를 K-means 군집분석과 연결하여 표시하는 방법, 2) MDS를 fviz cluster와 연결하는 방법, 3) principal component analysis (PCA)를 fviz cluster와 연결한 방법을 이용하였다. 사례분석의 예로서는 Major League Baseball (MLB) 자료에서 류현진이 적극적으로 활동하던 2013년, 2014년 투수자료를 이용하였다. 개발된 R 스트립트는 "http://www.knou.ac.kr/~sskim/ddpoutlier.html (R 스크립트와 R 패키지도 다운로드 받을 수 있다. 실행방법도 설명되어 있다.)"에서 다운받으면 된다.

Keywords

References

Barnett, V. and Lewis, T. (1994). Outliers in Statistical Data (3rd ed), John Wiley & Sons, Chichester.
Breunig, M., Kriegel, H., Ng, R., and Sander, J. (2000). LOF: identifying density-based local outliers. In SIGMOD '00 Proceedings of the 2000 ACM SIGMOD International Conference on Management of data, Texas, 93-104.
Butler, R. W., Davies, P. L., and Jhun, M. (1993). Asymptotics for the minimum covariance determinant estimator, The Annals of Statistics, 21, 1385-1400. https://doi.org/10.1214/aos/1176349264
Charrad, M., Ghazzali, N., Boiteau, V., and Niknafs, A. (2014). NbClust: an R package for determining the relevant number of clusters in a data set, Journal of Statistical Software, 61, 1-36.
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD'96 Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Oregon, 226-231.
Filzmoser, P. (2004). A multivariate outlier detection method, from: http://file.statistik.tuwien.ac.at/filz/papers/minsk04.pdf
Filzmoser, P., Maronna, R., and Werner, M. (2008). Outlier identification in high dimensions, Computational Statistics & Data Analysis, 52, 1694-1711. https://doi.org/10.1016/j.csda.2007.05.018
Hawkins, D. M. (1980). Identication of Outliers, Chapman & Hall, London.
Jayakumar, D. S. and Thomas, B. J. (2013). A new procedure of clustering based on multivariate outlier detection, Journal of Data Science, 11, 69-84.
Kassambara, A. (2017). Practical Guide to Cluster Analysis in R: Unsupervised Machine Learning, STHDA.
Kim, S., Kwon, S., and Cook, D. (2000). Interactive visualization of hierarchical clusters using MDS and MST, Metrika, 51, 39-51. https://doi.org/10.1007/s001840000043
Kriegel, H.-P., Kroger, P., and Zimek, A. (2010). Outlier detection techniques, The 2010 SIAM International Conference on Data Mining.
Mahalanobis, P. C. (1936). On the generalized distance in statistics. In Proceedings of the National Institute of Sciences (Calcutta), India, 2, 49-55.
Mojena, R. (1977). Hierarchical grouping methods and stopping rules: an evaluation, The Computer Journal, 20, 359-363. https://doi.org/10.1093/comjnl/20.4.359
Mojena, R. and Wishart, D. (1980). Stopping rules for Ward's clustering method. In COMPSTAT 1980 Proceedings, Physica-Verlag, 426-432.
Pamula, R., Deka, J. K., and Nandi, S. (2011). An outlier detection method based on clustering, 2011 Second International Conference on Emerging Applications of Information Technology, 253-256.
Penny, K. I. and Jolliffe, I. T. (2001). A comparison of multivariate outlier detection methods for clinical laboratory safety data, Journal of the Royal Statistical Society. Series D (The Statistician), 50, 295-308. https://doi.org/10.1111/1467-9884.00279
Prim, R. C. (1957). Shortest connection networks and some generalizations, Bell System Technical Journal, 36, 1389-1401. https://doi.org/10.1002/j.1538-7305.1957.tb01515.x
Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, Journal of Computational and Applied Mathematics, 20, 53-65. https://doi.org/10.1016/0377-0427(87)90125-7
Rousseeuw, P. J., Ruts, I., and Tukey, J. W. (1999). The Bagplot: a bivariate boxplot, The American Statistician, 53, 382-387.
Rousseeuw, P. J. and Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator, Technometrics, 41, 212-223. https://doi.org/10.1080/00401706.1999.10485670
Tibshirani, R., Walther, G., and Hastie, T. (2001), Estimating the number of clusters in a data set via the gap statistic, Journal of Royal Statistical Society: Series B (Statistical Methodology), 63, 411-423. https://doi.org/10.1111/1467-9868.00293
Wickham, H. (2010). ggplot2: Elegant Graphics for Data Analysis, Journal of Statistical Software, 35, Book Review 1.

The Korean Journal of Applied Statistics (응용통계연구)

Detecting outliers in multivariate data and visualization-R scripts

다변량 자료에서 특이점 검출 및 시각화 - R 스크립트

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)