• 제목/요약/키워드: Outlier analysis

검색결과 234건 처리시간 0.03초

Variable Selection and Outlier Detection for Automated K-means Clustering

  • Kim, Sung-Soo
    • Communications for Statistical Applications and Methods
    • /
    • 제22권1호
    • /
    • pp.55-67
    • /
    • 2015
  • An important problem in cluster analysis is the selection of variables that define cluster structure that also eliminate noisy variables that mask cluster structure; in addition, outlier detection is a fundamental task for cluster analysis. Here we provide an automated K-means clustering process combined with variable selection and outlier identification. The Automated K-means clustering procedure consists of three processes: (i) automatically calculating the cluster number and initial cluster center whenever a new variable is added, (ii) identifying outliers for each cluster depending on used variables, (iii) selecting variables defining cluster structure in a forward manner. To select variables, we applied VS-KM (variable-selection heuristic for K-means clustering) procedure (Brusco and Cradit, 2001). To identify outliers, we used a hybrid approach combining a clustering based approach and distance based approach. Simulation results indicate that the proposed automated K-means clustering procedure is effective to select variables and identify outliers. The implemented R program can be obtained at http://www.knou.ac.kr/~sskim/SVOKmeans.r.

Simultaneous outlier detection and variable selection via difference-based regression model and stochastic search variable selection

  • Park, Jong Suk;Park, Chun Gun;Lee, Kyeong Eun
    • Communications for Statistical Applications and Methods
    • /
    • 제26권2호
    • /
    • pp.149-161
    • /
    • 2019
  • In this article, we suggest the following approaches to simultaneous variable selection and outlier detection. First, we determine possible candidates for outliers using properties of an intercept estimator in a difference-based regression model, and the information of outliers is reflected in the multiple regression model adding mean shift parameters. Second, we select the best model from the model including the outlier candidates as predictors using stochastic search variable selection. Finally, we evaluate our method using simulations and real data analysis to yield promising results. In addition, we need to develop our method to make robust estimates. We will also to the nonparametric regression model for simultaneous outlier detection and variable selection.

Application of deterministic models for obtaining groundwater level distributions through outlier analysis

  • Dae-Hong Min;Saheed Mayowa Taiwo;Junghee Park;Sewon Kim;Hyung-Koo Yoon
    • Geomechanics and Engineering
    • /
    • 제35권5호
    • /
    • pp.499-509
    • /
    • 2023
  • The objective of this study is to perform outlier analysis to obtain the distribution of groundwater levels through the best model. The groundwater levels are measured in 10, 25 and 30 piezometers in Seoul, Daejeon and Suncheon in South Korea. Fifty-eight empirical distribution functions were applied to determine a suitable fit for the measured groundwater levels. The best fitted models based on the measured values are determined as the Generalized Pareto distribution, the Johnson SB distribution and the Normal distribution for Seoul, Daejeon and Suncheon, respectively; the reliability is estimated through the Anderson-Darling method. In this study, to choose the appropriate confidence interval, the relationship between the amount of outlier data and the confidence level is demonstrated, and then the 95% is selected at a reasonable confidence level. The best model shows a smaller error ratio than the GEV while the Mahalanobis distance and outlier labelling methods results are compared and validated. The outlier labelling and Mahalanobis distance based on median shown higher validated error ratios compared to their mean equivalent suggesting, the methods sensitivity to data structure.

회귀 분석에서 이상치가 미치는 영향 (The Effect of Outliers in Regression Analysis)

  • 김광수;배영주;이진규
    • 품질경영학회지
    • /
    • 제24권2호
    • /
    • pp.158-171
    • /
    • 1996
  • Outlier is one that appears to deviate extremely from other data in collected data. Thus treatment of outlier is very important work, because it is to distort the meaning of whole data in its analysis and to reduce the accuracy and validity for adequate models. The aim of this paper is to present some ways of handling outliers in given data and to investigate the effect of the analysis result before and after outlier reject. As a variety of methods has been proposed, we sellect the linear regression analysis and two linear programming techniques and compare to each result.

  • PDF

Outlier detection of GPS monitoring data using relational analysis and negative selection algorithm

  • Yi, Ting-Hua;Ye, X.W.;Li, Hong-Nan;Guo, Qing
    • Smart Structures and Systems
    • /
    • 제20권2호
    • /
    • pp.219-229
    • /
    • 2017
  • Outlier detection is an imperative task to identify the occurrence of abnormal events before the structures are suffered from sudden failure during their service lives. This paper proposes a two-phase method for the outlier detection of Global Positioning System (GPS) monitoring data. Prompt judgment of the occurrence of abnormal data is firstly carried out by use of the relational analysis as the relationship among the data obtained from the adjacent locations following a certain rule. Then, a negative selection algorithm (NSA) is adopted for further accurate localization of the abnormal data. To reduce the computation cost in the NSA, an improved scheme by integrating the adjustable radius into the training stage is designed and implemented. Numerical simulations and experimental verifications demonstrate that the proposed method is encouraging compared with the original method in the aspects of efficiency and reliability. This method is only based on the monitoring data without the requirement of the engineer expertise on the structural operational characteristics, which can be easily embedded in a software system for the continuous and reliable monitoring of civil infrastructure.

실제증발산 측정 시 연직 풍속 이상치 탐색 및 대체 (Outlier Detection and Replacement for Vertical Wind Speed in the Measurement of Actual Evapotranspiration)

  • 박천건;임창수;임광섭;채효석
    • 대한토목학회논문집
    • /
    • 제34권5호
    • /
    • pp.1455-1461
    • /
    • 2014
  • 본 연구에서는 2011년 5월, 6월, 7월에 덕유산 덕곡제에서 관측된 플럭스자료를 이용하여 에디공분산방법으로부터 증발산량을 측정하는 경우 발생할 수 있는 연직방향 풍속의 이상치 판별 및 대체에 대한 통계적 분석을 실시하였다. 연직방향 풍속의 이상치를 파악하기 위해 적용된 통계분석방법은 사분위수를 바탕으로 상자그림(boxplot)의 분석결과 중에 이상치를 판별하기 위한 interquartile range (IQR)을 적용하여 이상치를 탐색하였다. 또한 삭제하거나 평균값으로 대체하는 방법을 통하여 보완된 연직방향 풍속자료를 이용하여 증발산량을 측정하였으며, 이를 보완전의 증발산량과 비교분석하였다. 비교분석한 결과에 의하면 이상치를 대체하기 전의 증발산량과 이상치를 대체한 후의 증발산량 사이에 차이를 보였으며, 특히 강우 시에 보다 큰 차이를 보였다. 따라서 증발산량 측정과정에서 발생하는 이상치를 보완하기 위해 이상치를 삭제하거나 대체하여 증발산량을 측정하는 것이 필요하다.

가속도 응답을 이용한 이상치 해석 기반 역사 구조 건전성 평가 기법 개발 (Structural Health Monitoring Methodology based on Outlier Analysis using Acceleration of Subway Stations)

  • 신정열;안태기;이창길;박승희
    • 한국철도학회:학술대회논문집
    • /
    • 한국철도학회 2011년도 정기총회 및 추계학술대회 논문집
    • /
    • pp.281-286
    • /
    • 2011
  • Station structures, one of important infrastructures, which have been being operated since the 1970s, are especially vulnerable to even the medium-level earthquake and they could be damaged by long-term internal or external vibrations such as ambient vibrations. Recently, much attention has been paid to real-time monitoring of the fatal defect or long-term deterioration of civil infrastructures to ensure their safety and adequate performance throughout their life span. In this study, a structural health monitoring methodology using acceleration responses is proposed to evaluate the health-state of the station structures and to detect initial damage-stage. A damage index is developed using the acceleration data and it is applied to outlier analysis, one of unsupervised learning based pattern recognition methods. A threshold value for the outlier analysis is determined based on confidence level of the probabilistic distribution of the acceleration data. The probabilistic distribution is selected according to the feature of the collected data.

  • PDF

이질적 지반의 확률적 지층구분을 위한 이상치 분석의 적용 (Probabilistic stratification method for heterogeneous soils using outlier analysis)

  • 김정열;김현기;조남준
    • 한국지반공학회:학술대회논문집
    • /
    • 한국지반공학회 2010년도 춘계 학술발표회
    • /
    • pp.1229-1233
    • /
    • 2010
  • Subsurface investigation results obtained from soils with strong heterogeneity sometimes show a high level of uncertainty about stratigraphy and mechanical characteristics. In these cases, important engineering judgments are dependent on the personal experiences of engineers. However, many of such problems can be solved by applying appropriate statistical approaches. This study introduces the outlier analysis as a one of the statistical solutions and a simple case study is presented as an example.

  • PDF

시계열 자료에서의 특이치 발견 (Outlier detection in time series data)

  • 최정인;엄인옥;조형준
    • 응용통계연구
    • /
    • 제29권5호
    • /
    • pp.907-920
    • /
    • 2016
  • 본 논문의 목표는 분위수 자기회귀모형을 활용하여 시계열 자료에서 특이치를 발견하는 알고리즘을 제안하고, 기존의 방법들과 그 성능을 비교하여 실제 주가 조작 사례에 적용해 보는 것이다. 지금까지의 특이치 발견 연구는 대부분 일반적인 데이터 형태에서만 있어왔기 때문에 시계열 데이터에서의 연구는 미미한 편이다. 또한 모수적인 방법에만 제한되었는데, 모수적 모형은 복잡할 뿐만 아니라 소요되는 분석 시간도 길기 때문에 편리하지 않다. 따라서 본 연구에서는 분위수 자기회귀모형을 활용한 특이치 발견 알고리즘을 새롭게 제시하고, 다양한 경우의 모의실험을 통해 기존 알고리즘과 비교하도록 한다. 특히 시계열 자료에서의 특이치 발견은 주가 조작을 적발하는 데에 유용하게 활용될 수 있다. 시간에 따라 관측되던 주가가 갑자기 그 동안의 흐름에서 벗어나 특이치로 발견되었다면 혹시 인위적인 개입으로 조작된 것은 아닌지 의심해 볼 수 있기 때문이다. 따라서 실제 주가 조작 사례에 적용해 봄으로써 얼마나 빠른 시일 내에 주가 조작을 적발해 낼 수 있는지 살펴보았다.

어림과 나머지 성분을 이용한 연안 수온자료의 이상자료 감지 (Outlier Detection of the Coastal Water Temperature Monitoring Data Using the Approximate and Detail Components)

  • 조홍연;오지희
    • 한국해양환경ㆍ에너지학회지
    • /
    • 제15권2호
    • /
    • pp.156-162
    • /
    • 2012
  • 연안 환경모니터링 사업이 확대되면서 방대하게 축적되어 있는 연안 환경모니터링 자료의 통계적 분석을 위해서는 모니터링 자료에서 빈번하게 발생하는 이상 자료의 감지 처리가 우선적으로 필요하다. 본 연구에서는 연안 환경모니터링 자료의 어림성분과 나머지(또는 잔차)성분을 이용한 이상자료 진단기법을 제안하였다. 주기함수를 이용한 조화분석 방법과 국지 회귀함수추정 방법을 이용하여 각각 어림성분과 나머지성분을 추출한 후, 추출된 나머지성분 자료에 범용적인 Grubbs 검정기법 및 수정표본점수기법을 적용하여 이상자료를 진단 제거한 후 이상자료가 제거된 자료로 재구성하는 방법이다. 제안된 이 기법을 국립수산과학원 실시간어장정보시스템 제공하는 연안 수온 연속 모니터링 자료에 적용한 결과 이상자료가 성공적으로 제거되는 양상을 보이는 것으로 파악되었다.