• Title/Summary/Keyword: Cluster and Outlier Analysis

Search Result 14, Processing Time 0.03 seconds

Variable Selection and Outlier Detection for Automated K-means Clustering

  • Kim, Sung-Soo
    • Communications for Statistical Applications and Methods
    • /
    • v.22 no.1
    • /
    • pp.55-67
    • /
    • 2015
  • An important problem in cluster analysis is the selection of variables that define cluster structure that also eliminate noisy variables that mask cluster structure; in addition, outlier detection is a fundamental task for cluster analysis. Here we provide an automated K-means clustering process combined with variable selection and outlier identification. The Automated K-means clustering procedure consists of three processes: (i) automatically calculating the cluster number and initial cluster center whenever a new variable is added, (ii) identifying outliers for each cluster depending on used variables, (iii) selecting variables defining cluster structure in a forward manner. To select variables, we applied VS-KM (variable-selection heuristic for K-means clustering) procedure (Brusco and Cradit, 2001). To identify outliers, we used a hybrid approach combining a clustering based approach and distance based approach. Simulation results indicate that the proposed automated K-means clustering procedure is effective to select variables and identify outliers. The implemented R program can be obtained at http://www.knou.ac.kr/~sskim/SVOKmeans.r.

Outlier detection of main engine data of a ship using ensemble method (앙상블 기법을 이용한 선박 메인엔진 빅데이터의 이상치 탐지)

  • KIM, Dong-Hyun;LEE, Ji-Hwan;LEE, Sang-Bong;JUNG, Bong-Kyu
    • Journal of the Korean Society of Fisheries and Ocean Technology
    • /
    • v.56 no.4
    • /
    • pp.384-394
    • /
    • 2020
  • This paper proposes an outlier detection model based on machine learning that can diagnose the presence or absence of major engine parts through unsupervised learning analysis of main engine big data of a ship. Engine big data of the ship was collected for more than seven months, and expert knowledge and correlation analysis were performed to select features that are closely related to the operation of the main engine. For unsupervised learning analysis, ensemble model wherein many predictive models are strategically combined to increase the model performance, is used for anomaly detection. As a result, the proposed model successfully detected the anomalous engine status from the normal status. To validate our approach, clustering analysis was conducted to find out the different patterns of anomalies the anomalous point. By examining distribution of each cluster, we could successfully find the patterns of anomalies.

A Performance Comparison of Cluster Validity Indices based on K-means Algorithm (K-means 알고리즘 기반 클러스터링 인덱스 비교 연구)

  • Shim, Yo-Sung;Chung, Ji-Won;Choi, In-Chan
    • Asia pacific journal of information systems
    • /
    • v.16 no.1
    • /
    • pp.127-144
    • /
    • 2006
  • The K-means algorithm is widely used at the initial stage of data analysis in data mining process, partly because of its low time complexity and the simplicity of practical implementation. Cluster validity indices are used along with the algorithm in order to determine the number of clusters as well as the clustering results of datasets. In this paper, we present a performance comparison of sixteen indices, which are selected from forty indices in literature, while considering their applicability to nonhierarchical clustering algorithms. Data sets used in the experiment are generated based on multivariate normal distribution. In particular, four error types including standardization, outlier generation, error perturbation, and noise dimension addition are considered in the comparison. Through the experiment the effects of varying number of points, attributes, and clusters on the performance are analyzed. The result of the simulation experiment shows that Calinski and Harabasz index performs the best through the all datasets and that Davis and Bouldin index becomes a strong competitor as the number of points increases in dataset.

Travel Behavior Analysis for Short-Term KTX Passenger Demand Forecasting (KTX 단기수요 예측을 위한 통행행태 분석)

  • Kim, Han-Soo;Yun, Dong-Hee;Lee, Sung-Duk
    • Communications for Statistical Applications and Methods
    • /
    • v.19 no.1
    • /
    • pp.183-192
    • /
    • 2012
  • This study analyzes the travel behavior for short-term demand forecasting model of KTX. This research suggests the following. First, the outlier criteria is considered to appropriate twice the standard deviation of the traffic. Second, the result of a homogeneity test using ANOVA analysis has been divided into weekdays(Mon Thu and weekends(Fri Sun). Third, a cluster analysis for O/D pairs using trip frequency, traffic averages and th distance between stations was performed.

A Study on the Response Plan by Station Area Cluster through Time Series Analysis of Urban Rail Riders Before and After COVID-19 (COVID-19 전후 도시철도 승차인원 시계열 군집분석을 통한 역세권 군집별 대응방안 고찰)

  • Li, Cheng Xi;Jung, Hun Young
    • KSCE Journal of Civil and Environmental Engineering Research
    • /
    • v.43 no.3
    • /
    • pp.363-370
    • /
    • 2023
  • Due to the spread of COVID-19, the use of public transportation such as urban railroads has changed significantly since the beginning of 2020. Therefore, in this study, daily time series data for each urban railway station were collected for three years before COVID-19 and after the spread of COVID-19, and the similarity of time series analysis was evaluated through DTW (Dynamic Time Warping) distance method to derive regression centers for each cluster, and the effect of various external events such as COVID-19 on changes in the number of users was diagnosed as a time series impact detection function. In addition, the characteristics of use by cluster of urban railway stations were analyzed, and the change in passenger volume due to external shocks was identified. The purpose was to review measures for the maintenance and recovery of usage in the event of re-proliferation of COVID-19.

Multivariate Stratification Method for the Multipurpose Sample Survey : A Case Study of the Sample Design for Fisher Production Survey (다목적 표본조사를 위한 다변량 층화 : 어업비계통생산량조사를 위한 표본설계 사례)

  • Park, Jin-Woo;Kim, Young-Won;Lee, Seok-Hoon;Shin, Ji-Eun
    • Survey Research
    • /
    • v.9 no.1
    • /
    • pp.69-85
    • /
    • 2008
  • Stratification is a feature of the majority of field sample design. This paper considers the multivariate stratification strategy for multipurpose sample survey with several auxiliary variables. In a multipurpose survey, stratification procedure is very complicated because we have to simultaneously consider the efficiencies of stratification for several variables of interest. We propose stratification strategy based on factor analysis and cluster analysis using several stratification variables. To improve the efficiency of stratification, we first select the stratification variables by factor analysis, and then apply the K-means clustering algorithm to the formation of strata. An application of the stratification strategy in the sampling design for the Fisher Production Survey is discussed, and it turns out that the variances of estimators are significantly less than those obtained by simple random sampling.

  • PDF

Genetic Variation of Pinus densiflora Populations in South Korea Based on ESTP Markers (ESTP 표지를 이용한 국내 소나무 집단의 유전변이)

  • Ahn, Ji Young;Hong, Kyung Nak;Lee, Jei Wan;Hong, Yong Pyo;Kang, Hoduck
    • Korean Journal of Plant Resources
    • /
    • v.28 no.2
    • /
    • pp.279-289
    • /
    • 2015
  • Genetic diversity and genetic differentiation of thirteen Pinus densiflora populations in South Korea were estimated using nine ESTP (Expressed Sequence Tag Polymorphism) markers. The numbers of allele and the effective allele were 2.2 and 1.8, respectively. The percentage of polymorphic loci (P) was 98.8%. The observed and the expected heterozygosity were 0.391 and 0.402, respectively, and the eleven populations except for Ahngang and Gangneung population were under Hardy-Weinberg equilibrium state. The level of genetic differentiation (Wright’s FST = 0.057) was higher than those of isozyme or nSSR markers. We could not find out any relationship between the genetic distance and geographic distribution among populations from cluster analysis. Also, the genetic differentiation between populations was not correlated with the geographic distance (r = 0.017 and P = 0.344 from Mantel test). From the result of FST-outlier analysis to identify a locus under selection, six loci were detected at confidence interval of 99% by the frequentist’s method. However, only three loci (sams2+AluⅠ, sams2+RsaⅠ, PtNCS_p14A9+HaeⅢ) were presumed as outliers by Bayesian method. The sams2+AluⅠ and sams2+RsaⅠlocus were originated from the sams2 gene and seemed to be the loci under balancing selection.

Classification Methods for Automated Prediction of Power Load Patterns (전력 부하 패턴 자동 예측을 위한 분류 기법)

  • Minghao, Piao;Park, Jin-Hyung;Lee, Heon-Gyu;Ryu, Keun-Ho
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2008.06c
    • /
    • pp.26-30
    • /
    • 2008
  • Currently an automated methodology based on data mining techniques is presented for the prediction of customer load patterns in long duration load profiles. The proposed our approach consists of three stages: (i) data pre-processing: noise or outlier is removed and the continuous attribute-valued features are transformed to discrete values, (ii) cluster analysis: k-means clustering is used to create load pattern classes and the representative load profiles for each class and (iii) classification: we evaluated several supervised learning methods in order to select a suitable prediction method. According to the proposed methodology, power load measured from AMR (automatic meter reading) system, as well as customer indexes, were used as inputs for clustering. The output of clustering was the classification of representative load profiles (or classes). In order to evaluate the result of forecasting load patterns, the several classification methods were applied on a set of high voltage customers of the Korea power system and derived class labels from clustering and other features are used as input to produce classifiers. Lastly, the result of our experiments was presented.

  • PDF

Power Load Pattern Classification from AMR Data (AMR 데이터에서의 전력 부하 패턴 분류)

  • Piao, Minghao;Park, Jin-Hyung;Lee, Heon-Gyu;Shin, Jin-Ho;Ryu, Keun-Ho
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2008.05a
    • /
    • pp.231-234
    • /
    • 2008
  • Currently an automated methodology based on data mining techniques is presented for the prediction of customer load patterns in load demand data. The main aim of our work is to forecast customers' contract information from capacity of daily power consumption patterns. According to the result, we try to evaluate the contract information's suitability. The proposed our approach consists of three stages: (i) data preprocessing: noise or outlier is detected and removed (ii) cluster analysis: SOMs clustering is used to create load patterns and the representative load profiles and (iii) classification: we applied the K-NNs classifier in order to predict the customers' contract information base on power consumption patterns. According to the our proposed methodology, power load measured from AMR(automatic meter reading) system, as well as customer indexes, were used as inputs. The output was the classification of representative load profiles (or classes). Lastly, in order to evaluate KNN classification technique, the proposed methodology was applied on a set of high voltage customers of the Korea power system and the results of our experiments was presented.

A Study on the Spatial Distribution Patterns of Urban Green Spaces Using Local Spatial Autocorrelation Statistics (국지적 공간자기상관통계를 이용한 도시녹지의 공간적 분포패턴에 관한 연구)

  • Kim, Yun-Ki
    • Journal of Cadastre & Land InformatiX
    • /
    • v.50 no.1
    • /
    • pp.25-45
    • /
    • 2020
  • The primary purpose of this study is to compare and analyze the performance of local spatial autocorrelation techniques in identifying spatial distribution patterns of green spaces. To achieve the objective, this researcher uses satellite image analysis and spatial autocorrelation techniques. The result of the study shows that the LISA cluster map with the spatial outlier cluster is superior to other analytical methods in identifying the spatial distribution pattern of urban green space. This study can contribute to the related fields in that it uses several different research methods than the existing ones. Despite this differentiation and usefulness, this study has limitations in using low-resolution satellite imagery and NDVI among vegetation indices in identifying spatial distribution patterns of green areas. These limitations may be overcome in future studies by using UAV images or by simultaneously using several vegetation indices.