• Title/Summary/Keyword: K-mean Clustering

Search Result 280, Processing Time 0.024 seconds

A Computational Intelligence Based Online Data Imputation Method: An Application For Banking

  • Nishanth, Kancherla Jonah;Ravi, Vadlamani
    • Journal of Information Processing Systems
    • /
    • v.9 no.4
    • /
    • pp.633-650
    • /
    • 2013
  • All the imputation techniques proposed so far in literature for data imputation are offline techniques as they require a number of iterations to learn the characteristics of data during training and they also consume a lot of computational time. Hence, these techniques are not suitable for applications that require the imputation to be performed on demand and near real-time. The paper proposes a computational intelligence based architecture for online data imputation and extended versions of an existing offline data imputation method as well. The proposed online imputation technique has 2 stages. In stage 1, Evolving Clustering Method (ECM) is used to replace the missing values with cluster centers, as part of the local learning strategy. Stage 2 refines the resultant approximate values using a General Regression Neural Network (GRNN) as part of the global approximation strategy. We also propose extended versions of an existing offline imputation technique. The offline imputation techniques employ K-Means or K-Medoids and Multi Layer Perceptron (MLP)or GRNN in Stage-1and Stage-2respectively. Several experiments were conducted on 8benchmark datasets and 4 bank related datasets to assess the effectiveness of the proposed online and offline imputation techniques. In terms of Mean Absolute Percentage Error (MAPE), the results indicate that the difference between the proposed best offline imputation method viz., K-Medoids+GRNN and the proposed online imputation method viz., ECM+GRNN is statistically insignificant at a 1% level of significance. Consequently, the proposed online technique, being less expensive and faster, can be employed for imputation instead of the existing and proposed offline imputation techniques. This is the significant outcome of the study. Furthermore, GRNN in stage-2 uniformly reduced MAPE values in both offline and online imputation methods on all datasets.

The Algorithm of implementation for genome analysis ecosystems : Mitochondria's case (유전체 생태계 분석을 위한 알고리즘 구현: 미토콘드리아 사례)

  • Choi, Sung-Ja;Cho, Han-Wook
    • Journal of Digital Convergence
    • /
    • v.14 no.4
    • /
    • pp.349-353
    • /
    • 2016
  • The studies on the human environment and ecosystem analysis is being actively researched. In recent years, The service of genome analysis has been offering the customized service to prevent the disease as reading an individual's genome information. The genome information by analyzing technology is being required accurate and fast analyses of ecosystem-dielectrics due to the spread of the disease, the use of genetically modified organism and the influx of exotic. In this paper the algorithm of K-Mean clustering for a new classification system was utilized. It will provide new dielectrics information as quickly and accurately for many biologists.

Comparative Study of Quantitative Data Binning Methods in Association Rule

  • Choi, Jae-Ho;Park, Hee-Chang
    • Journal of the Korean Data and Information Science Society
    • /
    • v.19 no.3
    • /
    • pp.903-911
    • /
    • 2008
  • Association rule mining searches for interesting relationships among items in a given large database. Association rules are frequently used by retail stores to assist in marketing, advertising, floor placement, and inventory control. Many data is most quantitative data. There is a need for partitioning techniques to quantitative data. The partitioning process is referred to as binning. We introduce several binning methods ; parameter mean binning, equi-width binning, equi-depth binning, clustering-based binning. So we apply these binning methods to several distribution types of quantitative data and present the best binning method for association rule discovery.

  • PDF

Nonlinear damage detection using higher statistical moments of structural responses

  • Yu, Ling;Zhu, Jun-Hua
    • Structural Engineering and Mechanics
    • /
    • v.54 no.2
    • /
    • pp.221-237
    • /
    • 2015
  • An integrated method is proposed for structural nonlinear damage detection based on time series analysis and the higher statistical moments of structural responses in this study. It combines the time series analysis, the higher statistical moments of AR model residual errors and the fuzzy c-means (FCM) clustering techniques. A few comprehensive damage indexes are developed in the arithmetic and geometric mean of the higher statistical moments, and are classified by using the FCM clustering method to achieve nonlinear damage detection. A series of the measured response data, downloaded from the web site of the Los Alamos National Laboratory (LANL) USA, from a three-storey building structure considering the environmental variety as well as different nonlinear damage cases, are analyzed and used to assess the performance of the new nonlinear damage detection method. The effectiveness and robustness of the new proposed method are finally analyzed and concluded.

An Adaption of Pattern Sequence-based Electricity Load Forecasting with Match Filtering

  • Chu, Fazheng;Jung, Sung-Hwan
    • Journal of Korea Multimedia Society
    • /
    • v.20 no.5
    • /
    • pp.800-807
    • /
    • 2017
  • The Pattern Sequence-based Forecasting (PSF) is an approach to forecast the behavior of time series based on similar pattern sequences. The innovation of PSF method is to convert the load time series into a label sequence by clustering technique in order to lighten computational burden. However, it brings about a new problem in determining the number of clusters and it is subject to insufficient similar days occasionally. In this paper we proposed an adaption of the PSF method, which introduces a new clustering index to determine the number of clusters and imposes a threshold to solve the problem caused by insufficient similar days. Our experiments showed that the proposed method reduced the mean absolute percentage error (MAPE) about 15%, compared to the PSF method.

APMDI-CF: An Effective and Efficient Recommendation Algorithm for Online Users

  • Ya-Jun Leng;Zhi Wang;Dan Peng;Huan Zhang
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.17 no.11
    • /
    • pp.3050-3063
    • /
    • 2023
  • Recommendation systems provide personalized products or services to online users by mining their past preferences. Collaborative filtering is a popular recommendation technique because it is easy to implement. However, with the rapid growth of the number of users in recommendation systems, collaborative filtering suffers from serious scalability and sparsity problems. To address these problems, a novel collaborative filtering recommendation algorithm is proposed. The proposed algorithm partitions the users using affinity propagation clustering, and searches for k nearest neighbors in the partition where active user belongs, which can reduce the range of searching and improve real-time performance. When predicting the ratings of active user's unrated items, mean deviation method is used to impute values for neighbors' missing ratings, thus the sparsity can be decreased and the recommendation quality can be ensured. Experiments based on two different datasets show that the proposed algorithm is excellent both in terms of real-time performance and recommendation quality.

Problems in Fuzzy c-means and Its Possible Solutions (Fuzzy c-means의 문제점 및 해결 방안)

  • Heo, Gyeong-Yong;Seo, Jin-Seok;Lee, Im-Geun
    • Journal of the Korea Society of Computer and Information
    • /
    • v.16 no.1
    • /
    • pp.39-46
    • /
    • 2011
  • Clustering is one of the well-known unsupervised learning methods, in which a data set is grouped into some number of homogeneous clusters. There are numerous clustering algorithms available and they have been used in various applications. Fuzzy c-means (FCM), the most well-known partitional clustering algorithm, was established in 1970's and still in use. However, there are some unsolved problems in FCM and variants of FCM are still under development. In this paper, the problems in FCM are first explained and the available solutions are investigated, which is aimed to give researchers some possible ways of future research. Most of the FCM variants try to solve the problems using domain knowledge specific to a given problem. However, in this paper, we try to give general solutions without using any domain knowledge. Although there are more things left than discovered, this paper may be a good starting point for researchers newly entered into a clustering area.

Two-stage Sampling for Estimation of Prevalence of Bovine Tuberculosis (이단계표본추출을 이용한 소결핵병 유병률 추정)

  • Pak, Son-Il
    • Journal of Veterinary Clinics
    • /
    • v.28 no.4
    • /
    • pp.422-426
    • /
    • 2011
  • For a national survey in which wide geographic region or an entire country is targeted, multi-stage sampling approach is widely used to overcome the problem of simple random sampling, to consider both herd- and animallevel factors associated with disease occurrence, and to adjust clustering effect of disease in the population in the calculation of sample size. The aim of this study was to establish sample size for estimating bovine tuberculosis (TB) in Korea using stratified two-stage sampling design. The sample size was determined by taking into account the possible clustering of TB-infected animals on individual herds to increase the reliability of survey results. In this study, the country was stratified into nine provinces (administrative unit) and herd, the primary sampling unit, was considered as a cluster. For all analyses, design effect of 2, between-cluster prevalence of 50% to yield maximum sample size, and mean herd size of 65 were assumed due to lack of information available. Using a two-stage sampling scheme, the number of cattle sampled per herd was 65 cattle, regardless of confidence level, prevalence, and mean herd size examined. Number of clusters to be sampled at a 95% level of confidence was estimated to be 296, 74, 33, 19, 12, and 9 for desired precision of 0.01, 0.02, 0.03, 0.04, 0.05, and 0.06, respectively. Therefore, the total sample size with a 95% confidence level was 172,872, 43,218, 19,224, 10,818, 6,930, and 4,806 for desired precision ranging from 0.01 to 0.06. The sample size was increased with desired precision and design effect. In a situation where the number of cattle sampled per herd is fixed ranging from 5 to 40 with a 5-head interval, total sample size with a 95% confidence level was estimated to be 6,480, 10,080, 13,770, 17,280, 20.925, 24,570, 28,350, and 31,680, respectively. The percent increase in total sample size resulting from the use of intra-cluster correlation coefficient of 0.3 was 22.2, 32.1, 36.3, 39.6, 41.9, 42.9, 42,2, and 44.3%, respectively in comparison to the use of coefficient of 0.2.

SPOT/VEGETATION-based Algorithm for the Discrimination of Cloud and Snow (SPOT/VEGETATION 영상을 이용한 눈과 구름의 분류 알고리즘)

  • Han Kyung-Soo;Kim Young-Seup
    • Korean Journal of Remote Sensing
    • /
    • v.20 no.4
    • /
    • pp.235-244
    • /
    • 2004
  • This study focuses on the assessment for proposed algorithm to discriminate cloudy pixels from snowy pixels through use of visible, near infrared, and short wave infrared channel data in VEGETATION-1 sensor embarked on SPOT-4 satellite. Traditional threshold algorithms for cloud and snow masks did not show very good accuracy. Instead of these independent masking procedures, K-Means clustering scheme is employed for cloud/snow discrimination in this study. The pixels used in clustering were selected through an integration of two threshold algorithms, which group ensemble the snow and cloud pixels. This may give a opportunity to simplify the clustering procedure and to improve the accuracy as compared with full image clustering. This paper also compared the results with threshold methods of snow cover and clouds, and assesses discrimination capability in VEGETATION channels. The quality of the cloud and snow mask even more improved when present algorithm is implemented. The discrimination errors were considerably reduced by 19.4% and 9.7% for cloud mask and snow mask as compared with traditional methods, respectively.

Automatic Source Classification Algorithm using Mean-Shift Clustering and stepwise merging in Color Image (컬러영상에서 Mean-Shift 군집화와 단계별 병합 방법을 이용한 자동 원료 선별 알고리즘)

  • Kim, Sang-Jun;Jang, JiHyeon;Ko, ByoungChul
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2015.10a
    • /
    • pp.1597-1599
    • /
    • 2015
  • 본 논문에서는 곡물이나 광석 등의 원료들 중에서 양품 및 불량품을 검출하기 위해, Color CCD 카메라로 촬영한 원료영상에서 Mean-Shift 클러스터링 알고리즘과 단계별 병합 방법을 제안하고 있다. 먼저 원료 학습 영상에서 배경을 제거하고 영상 색 분포정도를 기준으로 모폴로지를 이용하여 영상의 전경맵을 얻는다. 전경맵 영상에 대해서 Mean-Shift 군집화 알고리즘을 적용하여 영상을 N개의 군집으로 나누고, 단계별로 위치 근접성, 색상대푯값 유사성을 비교하여 비슷한 군집끼리 통합한다. 이렇게 통합된 원료 객체는 영상채널마다의 연관관계를 반영할 수 있도록 RG/GB/BR의 2차원 컬러분포도로 표현한다. 원료 객체별로 변환된 2차원 컬러 분포도에서 분포의 주성분의 기울기와 타원들을 생성한다. 객체별 분포 타원은 테스트 원료 영상데이터에서 양품과 불량품을 검출하는 임계값이 된다. 본 논문에서 제안한 방법으로 다양한 원료영상에 실험한 결과, 기존 선별방식에 비해 사용자의 인위적 조작이 적고 정확한 원료 선별 결과를 얻을 수 있었다.