• Title/Summary/Keyword: Outlier analysis

Search Result 234, Processing Time 0.163 seconds

Variable Selection and Outlier Detection for Automated K-means Clustering

  • Kim, Sung-Soo
    • Communications for Statistical Applications and Methods
    • /
    • v.22 no.1
    • /
    • pp.55-67
    • /
    • 2015
  • An important problem in cluster analysis is the selection of variables that define cluster structure that also eliminate noisy variables that mask cluster structure; in addition, outlier detection is a fundamental task for cluster analysis. Here we provide an automated K-means clustering process combined with variable selection and outlier identification. The Automated K-means clustering procedure consists of three processes: (i) automatically calculating the cluster number and initial cluster center whenever a new variable is added, (ii) identifying outliers for each cluster depending on used variables, (iii) selecting variables defining cluster structure in a forward manner. To select variables, we applied VS-KM (variable-selection heuristic for K-means clustering) procedure (Brusco and Cradit, 2001). To identify outliers, we used a hybrid approach combining a clustering based approach and distance based approach. Simulation results indicate that the proposed automated K-means clustering procedure is effective to select variables and identify outliers. The implemented R program can be obtained at http://www.knou.ac.kr/~sskim/SVOKmeans.r.

Simultaneous outlier detection and variable selection via difference-based regression model and stochastic search variable selection

  • Park, Jong Suk;Park, Chun Gun;Lee, Kyeong Eun
    • Communications for Statistical Applications and Methods
    • /
    • v.26 no.2
    • /
    • pp.149-161
    • /
    • 2019
  • In this article, we suggest the following approaches to simultaneous variable selection and outlier detection. First, we determine possible candidates for outliers using properties of an intercept estimator in a difference-based regression model, and the information of outliers is reflected in the multiple regression model adding mean shift parameters. Second, we select the best model from the model including the outlier candidates as predictors using stochastic search variable selection. Finally, we evaluate our method using simulations and real data analysis to yield promising results. In addition, we need to develop our method to make robust estimates. We will also to the nonparametric regression model for simultaneous outlier detection and variable selection.

Application of deterministic models for obtaining groundwater level distributions through outlier analysis

  • Dae-Hong Min;Saheed Mayowa Taiwo;Junghee Park;Sewon Kim;Hyung-Koo Yoon
    • Geomechanics and Engineering
    • /
    • v.35 no.5
    • /
    • pp.499-509
    • /
    • 2023
  • The objective of this study is to perform outlier analysis to obtain the distribution of groundwater levels through the best model. The groundwater levels are measured in 10, 25 and 30 piezometers in Seoul, Daejeon and Suncheon in South Korea. Fifty-eight empirical distribution functions were applied to determine a suitable fit for the measured groundwater levels. The best fitted models based on the measured values are determined as the Generalized Pareto distribution, the Johnson SB distribution and the Normal distribution for Seoul, Daejeon and Suncheon, respectively; the reliability is estimated through the Anderson-Darling method. In this study, to choose the appropriate confidence interval, the relationship between the amount of outlier data and the confidence level is demonstrated, and then the 95% is selected at a reasonable confidence level. The best model shows a smaller error ratio than the GEV while the Mahalanobis distance and outlier labelling methods results are compared and validated. The outlier labelling and Mahalanobis distance based on median shown higher validated error ratios compared to their mean equivalent suggesting, the methods sensitivity to data structure.

The Effect of Outliers in Regression Analysis (회귀 분석에서 이상치가 미치는 영향)

  • Kim, Kwang-Soo;Bae, Young-Ju;Lee, Jin-Gue
    • Journal of Korean Society for Quality Management
    • /
    • v.24 no.2
    • /
    • pp.158-171
    • /
    • 1996
  • Outlier is one that appears to deviate extremely from other data in collected data. Thus treatment of outlier is very important work, because it is to distort the meaning of whole data in its analysis and to reduce the accuracy and validity for adequate models. The aim of this paper is to present some ways of handling outliers in given data and to investigate the effect of the analysis result before and after outlier reject. As a variety of methods has been proposed, we sellect the linear regression analysis and two linear programming techniques and compare to each result.

  • PDF

Outlier detection of GPS monitoring data using relational analysis and negative selection algorithm

  • Yi, Ting-Hua;Ye, X.W.;Li, Hong-Nan;Guo, Qing
    • Smart Structures and Systems
    • /
    • v.20 no.2
    • /
    • pp.219-229
    • /
    • 2017
  • Outlier detection is an imperative task to identify the occurrence of abnormal events before the structures are suffered from sudden failure during their service lives. This paper proposes a two-phase method for the outlier detection of Global Positioning System (GPS) monitoring data. Prompt judgment of the occurrence of abnormal data is firstly carried out by use of the relational analysis as the relationship among the data obtained from the adjacent locations following a certain rule. Then, a negative selection algorithm (NSA) is adopted for further accurate localization of the abnormal data. To reduce the computation cost in the NSA, an improved scheme by integrating the adjustable radius into the training stage is designed and implemented. Numerical simulations and experimental verifications demonstrate that the proposed method is encouraging compared with the original method in the aspects of efficiency and reliability. This method is only based on the monitoring data without the requirement of the engineer expertise on the structural operational characteristics, which can be easily embedded in a software system for the continuous and reliable monitoring of civil infrastructure.

Outlier Detection and Replacement for Vertical Wind Speed in the Measurement of Actual Evapotranspiration (실제증발산 측정 시 연직 풍속 이상치 탐색 및 대체)

  • Park, Chun Gun;Rim, Chang-Soo;Lim, Kwang-Suop;Chae, Hyo-Sok
    • KSCE Journal of Civil and Environmental Engineering Research
    • /
    • v.34 no.5
    • /
    • pp.1455-1461
    • /
    • 2014
  • In this study, using flux data measured in Deokgokje reservoir watershed near Deokyu mountain in May, June, and July 2011, statistical analysis was conducted for outlier detection and replacement for vertical wind speed in the measurement of evapotranspiration based on eddy covariance method. To statistically analyze the outliers of vertical wind speed, the outlier detection method based on interquartile range (IQR) in boxplot was employed and the detected outliers were deleted or replaced with mean. The comparison was conducted for the measured evapotranspiration before and after the outlier replacement. The study results showed that there is a difference between evapotranspiration before outlier replacement and evapotranspiration after outlier replacement, especially during the rainy day. Therefore, based on the study results, the outliers should be deleted or replaced in the measurement of evapotranspiration.

Structural Health Monitoring Methodology based on Outlier Analysis using Acceleration of Subway Stations (가속도 응답을 이용한 이상치 해석 기반 역사 구조 건전성 평가 기법 개발)

  • Shin, Jeong-Ryol;An, Tae-Ki;Lee, Chang-Gil;Park, Seung-Hee
    • Proceedings of the KSR Conference
    • /
    • 2011.10a
    • /
    • pp.281-286
    • /
    • 2011
  • Station structures, one of important infrastructures, which have been being operated since the 1970s, are especially vulnerable to even the medium-level earthquake and they could be damaged by long-term internal or external vibrations such as ambient vibrations. Recently, much attention has been paid to real-time monitoring of the fatal defect or long-term deterioration of civil infrastructures to ensure their safety and adequate performance throughout their life span. In this study, a structural health monitoring methodology using acceleration responses is proposed to evaluate the health-state of the station structures and to detect initial damage-stage. A damage index is developed using the acceleration data and it is applied to outlier analysis, one of unsupervised learning based pattern recognition methods. A threshold value for the outlier analysis is determined based on confidence level of the probabilistic distribution of the acceleration data. The probabilistic distribution is selected according to the feature of the collected data.

  • PDF

Probabilistic stratification method for heterogeneous soils using outlier analysis (이질적 지반의 확률적 지층구분을 위한 이상치 분석의 적용)

  • Kim, Jeong-Yul;Kim, Hyun-Ki;Cho, Nam-Jun
    • Proceedings of the Korean Geotechical Society Conference
    • /
    • 2010.03a
    • /
    • pp.1229-1233
    • /
    • 2010
  • Subsurface investigation results obtained from soils with strong heterogeneity sometimes show a high level of uncertainty about stratigraphy and mechanical characteristics. In these cases, important engineering judgments are dependent on the personal experiences of engineers. However, many of such problems can be solved by applying appropriate statistical approaches. This study introduces the outlier analysis as a one of the statistical solutions and a simple case study is presented as an example.

  • PDF

Outlier detection in time series data (시계열 자료에서의 특이치 발견)

  • Choi, Jeong In;Um, In Ok;Choa, Hyung Jun
    • The Korean Journal of Applied Statistics
    • /
    • v.29 no.5
    • /
    • pp.907-920
    • /
    • 2016
  • This study suggests an outlier detection algorithm that uses quantile autoregressive model in time series data, eventually applying it to actual stock manipulation cases by comparing its performance to existing methods. Studies on outlier detection have traditionally been conducted mostly in general data and those in time series data are insufficient. They have also been limited to a parametric model, which is not convenient as it is complicated with an analysis that takes a long time. Thus, we suggest a new algorithm of outlier detection in time series data and through various simulations, compare it to existing algorithms. Especially, the outlier detection algorithm in time series data can be useful in finding stock manipulation. If stock price which had a certain pattern goes out of flow and generates an outlier, it can be due to intentional intervention and manipulation. We examined how fast the model can detect stock manipulations by applying it to actual stock manipulation cases.

Outlier Detection of the Coastal Water Temperature Monitoring Data Using the Approximate and Detail Components (어림과 나머지 성분을 이용한 연안 수온자료의 이상자료 감지)

  • Cho, Hong-Yeon;Oh, Ji-Hee
    • Journal of the Korean Society for Marine Environment & Energy
    • /
    • v.15 no.2
    • /
    • pp.156-162
    • /
    • 2012
  • Outlier detection and treatment process is highly required as the first step for the statistical analysis of the monitoring data having many outliers frequently occurred in the coastal environmental monitoring projects. In this study, the outlier detection method using the approximate and detail (or residual) components of the (raw) data is suggested. The approximate and detail components of the data can be separated by the diverse filtering and smoothing methods. The decomposition of the data is carried out by the harmonic analysis and local regression curve, respectively. Then, the Grubbs' test and modified z-score method widely used to detect outliers in the data are applied to the detail components of the water temperature data. The new data set is reconstructed after removed the outliers detected by these methods. It can be shown that the suggested process is successfully applied to the outlier detection of the coastal water temperature monitoring data provided by the Real-time Information System for Aquaculture Environment, National Fisheries Research and Development Institute (NFRDI).