[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.5351/KJAS.2018.31.4.517

Detecting outliers in multivariate data and visualization-R scripts

Kim, Sung-Soo (Department of Information Statistics, Korea National Open University)

Publication Information

The Korean Journal of Applied Statistics / v.31, no.4, 2018 , pp. 517-528 More about this Journal

Abstract

We provide R scripts to detect outliers in multivariate data and visualization. Detecting outliers is provided using three approaches 1) Robust Mahalanobis distance, 2) High Dimensional data, 3) density-based approach methods. We use the following techniques to visualize detected potential outliers 1) multidimensional scaling (MDS) and minimal spanning tree (MST) with k-means clustering, 2) MDS with fviz cluster, 3) principal component analysis (PCA) with fviz cluster. For real data sets, we use MLB pitching data including Ryu, Hyun-jin in 2013 and 2014. The developed R scripts can be downloaded at "http://www.knou.ac.kr/~sskim/ddpoutlier.html" (R scripts and also R package can be downloaded here).

Keywords

potential outliers; visualization; Mahalanobis distance; multidimensional scaling(MDS); minimal spanning tree(MST); principal component analysis(PCA);

Citations & Related Records

Reference

1	Rousseeuw, P. J., Ruts, I., and Tukey, J. W. (1999). The Bagplot: a bivariate boxplot, The American Statistician, 53, 382-387.
2	Rousseeuw, P. J. and Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator, Technometrics, 41, 212-223. DOI
3	Tibshirani, R., Walther, G., and Hastie, T. (2001), Estimating the number of clusters in a data set via the gap statistic, Journal of Royal Statistical Society: Series B (Statistical Methodology), 63, 411-423. DOI
4	Wickham, H. (2010). ggplot2: Elegant Graphics for Data Analysis, Journal of Statistical Software, 35, Book Review 1.
5	Prim, R. C. (1957). Shortest connection networks and some generalizations, Bell System Technical Journal, 36, 1389-1401. DOI
6	Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, Journal of Computational and Applied Mathematics, 20, 53-65. DOI
7	Filzmoser, P. (2004). A multivariate outlier detection method, from: http://file.statistik.tuwien.ac.at/filz/papers/minsk04.pdf
8	Filzmoser, P., Maronna, R., and Werner, M. (2008). Outlier identification in high dimensions, Computational Statistics & Data Analysis, 52, 1694-1711. DOI
9	Hawkins, D. M. (1980). Identication of Outliers, Chapman & Hall, London.
10	Kassambara, A. (2017). Practical Guide to Cluster Analysis in R: Unsupervised Machine Learning, STHDA.
11	Kim, S., Kwon, S., and Cook, D. (2000). Interactive visualization of hierarchical clusters using MDS and MST, Metrika, 51, 39-51. DOI
12	Kriegel, H.-P., Kroger, P., and Zimek, A. (2010). Outlier detection techniques, The 2010 SIAM International Conference on Data Mining.
13	Mahalanobis, P. C. (1936). On the generalized distance in statistics. In Proceedings of the National Institute of Sciences (Calcutta), India, 2, 49-55.
14	Mojena, R. (1977). Hierarchical grouping methods and stopping rules: an evaluation, The Computer Journal, 20, 359-363. DOI
15	Barnett, V. and Lewis, T. (1994). Outliers in Statistical Data (3rd ed), John Wiley & Sons, Chichester.
16	Mojena, R. and Wishart, D. (1980). Stopping rules for Ward's clustering method. In COMPSTAT 1980 Proceedings, Physica-Verlag, 426-432.
17	Pamula, R., Deka, J. K., and Nandi, S. (2011). An outlier detection method based on clustering, 2011 Second International Conference on Emerging Applications of Information Technology, 253-256.
18	Penny, K. I. and Jolliffe, I. T. (2001). A comparison of multivariate outlier detection methods for clinical laboratory safety data, Journal of the Royal Statistical Society. Series D (The Statistician), 50, 295-308. DOI
19	Breunig, M., Kriegel, H., Ng, R., and Sander, J. (2000). LOF: identifying density-based local outliers. In SIGMOD '00 Proceedings of the 2000 ACM SIGMOD International Conference on Management of data, Texas, 93-104.
20	Butler, R. W., Davies, P. L., and Jhun, M. (1993). Asymptotics for the minimum covariance determinant estimator, The Annals of Statistics, 21, 1385-1400. DOI
21	Charrad, M., Ghazzali, N., Boiteau, V., and Niknafs, A. (2014). NbClust: an R package for determining the relevant number of clusters in a data set, Journal of Statistical Software, 61, 1-36.
22	Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD'96 Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Oregon, 226-231.
23	Jayakumar, D. S. and Thomas, B. J. (2013). A new procedure of clustering based on multivariate outlier detection, Journal of Data Science, 11, 69-84.

KSCI

Detecting outliers in multivariate data and visualization-R scripts 다변량 자료에서 특이점 검출 및 시각화 - R 스크립트

Detecting outliers in multivariate data and visualization-R scripts