Browse > Article
http://dx.doi.org/10.5351/KJAS.2018.31.4.517

Detecting outliers in multivariate data and visualization-R scripts  

Kim, Sung-Soo (Department of Information Statistics, Korea National Open University)
Publication Information
The Korean Journal of Applied Statistics / v.31, no.4, 2018 , pp. 517-528 More about this Journal
Abstract
We provide R scripts to detect outliers in multivariate data and visualization. Detecting outliers is provided using three approaches 1) Robust Mahalanobis distance, 2) High Dimensional data, 3) density-based approach methods. We use the following techniques to visualize detected potential outliers 1) multidimensional scaling (MDS) and minimal spanning tree (MST) with k-means clustering, 2) MDS with fviz cluster, 3) principal component analysis (PCA) with fviz cluster. For real data sets, we use MLB pitching data including Ryu, Hyun-jin in 2013 and 2014. The developed R scripts can be downloaded at "http://www.knou.ac.kr/~sskim/ddpoutlier.html" (R scripts and also R package can be downloaded here).
Keywords
potential outliers; visualization; Mahalanobis distance; multidimensional scaling(MDS); minimal spanning tree(MST); principal component analysis(PCA);
Citations & Related Records
연도 인용수 순위
  • Reference
1 Rousseeuw, P. J., Ruts, I., and Tukey, J. W. (1999). The Bagplot: a bivariate boxplot, The American Statistician, 53, 382-387.
2 Rousseeuw, P. J. and Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator, Technometrics, 41, 212-223.   DOI
3 Tibshirani, R., Walther, G., and Hastie, T. (2001), Estimating the number of clusters in a data set via the gap statistic, Journal of Royal Statistical Society: Series B (Statistical Methodology), 63, 411-423.   DOI
4 Wickham, H. (2010). ggplot2: Elegant Graphics for Data Analysis, Journal of Statistical Software, 35, Book Review 1.
5 Prim, R. C. (1957). Shortest connection networks and some generalizations, Bell System Technical Journal, 36, 1389-1401.   DOI
6 Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, Journal of Computational and Applied Mathematics, 20, 53-65.   DOI
7 Filzmoser, P. (2004). A multivariate outlier detection method, from: http://file.statistik.tuwien.ac.at/filz/papers/minsk04.pdf
8 Filzmoser, P., Maronna, R., and Werner, M. (2008). Outlier identification in high dimensions, Computational Statistics & Data Analysis, 52, 1694-1711.   DOI
9 Hawkins, D. M. (1980). Identication of Outliers, Chapman & Hall, London.
10 Kassambara, A. (2017). Practical Guide to Cluster Analysis in R: Unsupervised Machine Learning, STHDA.
11 Kim, S., Kwon, S., and Cook, D. (2000). Interactive visualization of hierarchical clusters using MDS and MST, Metrika, 51, 39-51.   DOI
12 Kriegel, H.-P., Kroger, P., and Zimek, A. (2010). Outlier detection techniques, The 2010 SIAM International Conference on Data Mining.
13 Mahalanobis, P. C. (1936). On the generalized distance in statistics. In Proceedings of the National Institute of Sciences (Calcutta), India, 2, 49-55.
14 Mojena, R. (1977). Hierarchical grouping methods and stopping rules: an evaluation, The Computer Journal, 20, 359-363.   DOI
15 Barnett, V. and Lewis, T. (1994). Outliers in Statistical Data (3rd ed), John Wiley & Sons, Chichester.
16 Mojena, R. and Wishart, D. (1980). Stopping rules for Ward's clustering method. In COMPSTAT 1980 Proceedings, Physica-Verlag, 426-432.
17 Pamula, R., Deka, J. K., and Nandi, S. (2011). An outlier detection method based on clustering, 2011 Second International Conference on Emerging Applications of Information Technology, 253-256.
18 Penny, K. I. and Jolliffe, I. T. (2001). A comparison of multivariate outlier detection methods for clinical laboratory safety data, Journal of the Royal Statistical Society. Series D (The Statistician), 50, 295-308.   DOI
19 Breunig, M., Kriegel, H., Ng, R., and Sander, J. (2000). LOF: identifying density-based local outliers. In SIGMOD '00 Proceedings of the 2000 ACM SIGMOD International Conference on Management of data, Texas, 93-104.
20 Butler, R. W., Davies, P. L., and Jhun, M. (1993). Asymptotics for the minimum covariance determinant estimator, The Annals of Statistics, 21, 1385-1400.   DOI
21 Charrad, M., Ghazzali, N., Boiteau, V., and Niknafs, A. (2014). NbClust: an R package for determining the relevant number of clusters in a data set, Journal of Statistical Software, 61, 1-36.
22 Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD'96 Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Oregon, 226-231.
23 Jayakumar, D. S. and Thomas, B. J. (2013). A new procedure of clustering based on multivariate outlier detection, Journal of Data Science, 11, 69-84.