• Title/Summary/Keyword: Clustering sampling

Search Result 86, Processing Time 0.028 seconds

Approximate Clustering on Data Streams Using Discrete Cosine Transform

  • Yu, Feng;Oyana, Damalie;Hou, Wen-Chi;Wainer, Michael
    • Journal of Information Processing Systems
    • /
    • v.6 no.1
    • /
    • pp.67-78
    • /
    • 2010
  • In this study, a clustering algorithm that uses DCT transformed data is presented. The algorithm is a grid density-based clustering algorithm that can identify clusters of arbitrary shape. Streaming data are transformed and reconstructed as needed for clustering. Experimental results show that DCT is able to approximate a data distribution efficiently using only a small number of coefficients and preserve the clusters well. The grid based clustering algorithm works well with DCT transformed data, demonstrating the viability of DCT for data stream clustering applications.

Clustering Algorithm by Grid-based Sampling

  • Park, Hee-Chang;Ryu, Jee-Hyun
    • 한국데이터정보과학회:학술대회논문집
    • /
    • 2003.05a
    • /
    • pp.97-108
    • /
    • 2003
  • Cluster analysis has been widely used in many applications, such that pattern analysis or recognition, data analysis, image processing, market research on on-line or off-line and so on. Clustering can identify dense and sparse regions among data attributes or object attributes. But it requires many hours to get clusters that we want, because of clustering is more primitive, explorative and we make many data an object of cluster analysis. In this paper we propose a new method of clustering using sample based on grid. It is more fast than any traditional clustering method and maintains its accuracy. It reduces running time by using grid-based sample. And other clustering applications can be more effective by using this methods with its original methods.

  • PDF

K-means Clustering using a Center Of Gravity for grid-based sample

  • Park, Hee-Chang;Lee, Sun-Myung
    • 한국데이터정보과학회:학술대회논문집
    • /
    • 2004.04a
    • /
    • pp.51-60
    • /
    • 2004
  • K-means clustering is an iterative algorithm in which items are moved among sets of clusters until the desired set is reached. K-means clustering has been widely used in many applications, such as market research, pattern analysis or recognition, image processing, etc. It can identify dense and sparse regions among data attributes or object attributes. But k-means algorithm requires many hours to get k clusters that we want, because it is more primitive, explorative. In this paper we propose a new method of k-means clustering using a center of gravity for grid-based sample. It is more fast than any traditional clustering method and maintains its accuracy.

  • PDF

Clustering Algorithm by Grid-based Sampling

  • Park, Hee-Chang;Ryu, Jee-Hyun;Lee, Sung-Yong
    • Journal of the Korean Data and Information Science Society
    • /
    • v.14 no.3
    • /
    • pp.535-543
    • /
    • 2003
  • Cluster analysis has been widely used in many applications, such as pattern analysis or recognition, data analysis, image processing, market research on on-line or off-line and so on. Clustering can identify dense and sparse regions among data attributes or object attributes. But it requires many hours to get clusters that we want, because clustering is more primitive, explorative and we make many data an object of cluster analysis. In this paper we propose a new method of clustering using sample based on grid. It is more fast than any traditional clustering method and maintains its accuracy.

  • PDF

Similarity of Sampling Sites by Water Quality (수질 관측지점 유사성 측정방법 연구)

  • Kwon, Se-Hyug;Lee, Yo-Sang
    • Communications for Statistical Applications and Methods
    • /
    • v.17 no.1
    • /
    • pp.39-45
    • /
    • 2010
  • As the value of environment is increasing, the water quality has been a matter of interest to the nation and people. Research on water quality has been widely studied, but focused on geographical characteristic and river characteristics like inflow, outflow, quantity and speed of water. In this paper, two approaches to measure the similarity of sampling sites by using water quality data are discussed and compared with two-years empirical data of Yongdam-Dam. The existing method has calculated their similarities with principal component scores. The proposed approach in this paper use correlation matrix of water quality related variables and MDS for measuring the similarity, which is shown to be better in the sense of being clustering which is identical to geographical clustering since it can consider the time series pattern of water quality.

A Data Mining Procedure for Unbalanced Binary Classification (불균형 이분 데이터 분류분석을 위한 데이터마이닝 절차)

  • Jung, Han-Na;Lee, Jeong-Hwa;Jun, Chi-Hyuck
    • Journal of Korean Institute of Industrial Engineers
    • /
    • v.36 no.1
    • /
    • pp.13-21
    • /
    • 2010
  • The prediction of contract cancellation of customers is essential in insurance companies but it is a difficult problem because the customer database is large and the target or cancelled customers are a small proportion of the database. This paper proposes a new data mining approach to the binary classification by handling a large-scale unbalanced data. Over-sampling, clustering, regularized logistic regression and boosting are also incorporated in the proposed approach. The proposed approach was applied to a real data set in the area of insurance and the results were compared with some other classification techniques.

A method for multiple identical object tracking (동일한 다중 물체 추적 기법)

  • Chun, Gi-Hong;Kang, Hang-Bong
    • Proceedings of the IEEK Conference
    • /
    • 2006.06a
    • /
    • pp.679-680
    • /
    • 2006
  • 이 논문에서는 가장 많이 알려진 tracking 알고리즘인 Particle-Filter 의 단점을 motion vector 를 기반으로 예측한 sampling 방법과 K-means clustering 을 이용하여 해결하려고 한다. Tracking 에서의 문제는 다중의 유사한 객체들이 merge 후 split 될 때 제대로 추적을 하지 못하고 한 객체만을 추적 한다는 데에 있었다. 그리고 split 되어 객체별로 추적이 가능하더라도 이전에 추적한 객체를 올바로 labeling 하지 못하는 문제가 있다는 것이다. 이 merge-split 문제는 개량된 K-means clustering 을 이용하고, labeling 문제는 motion vector 를 이용한 개량된 sampling 방법으로 개선하였다.

  • PDF

An Optimization Approach to Data Clustering

  • Kim, Ju-Mi;Olafsson, Sigurdur
    • Proceedings of the Korean Operations and Management Science Society Conference
    • /
    • 2005.05a
    • /
    • pp.621-628
    • /
    • 2005
  • Scalability of clustering algorithms is critical issues facing the data mining community. This is particularly true for computationally intense tasks such as data clustering. Random sampling of instances is one possible means of achieving scalability but a pervasive problem with this approach is how to deal with the noise that this introduces in the evaluation of the learning algorithm. This paper develops a new optimization based clustering approach using an algorithms specifically designed for noisy performance. Numerical results illustrate that with this algorithm substantial benefits can be achieved in terms of computational time without sacrificing solution quality.

  • PDF

A Study on Partial Pattern Estimation for Sequential Agglomerative Hierarchical Nested Model (SAHN 모델의 부분적 패턴 추정 방법에 대한 연구)

  • Jang, Kyung-Won;Ahn, Tae-Chon
    • Proceedings of the KIEE Conference
    • /
    • 2005.10b
    • /
    • pp.143-145
    • /
    • 2005
  • In this paper, an empirical study result on pattern estimation method is devoted to reveal underlying data patterns with a relatively reduced computational cost. Presented method performs crisp type clustering with given n number of data samples by means of the sequential agglomerative hierarchical nested model (SAHN). Conventional SAHN based clustering requires large computation time in the initial step of algorithm. To deal with this concern, we modified overall process with a partial approach. In the beginning of this method, we divide given data set to several sub groups with uniform sampling and then each divided sub data group is applied to SAHN based method. The advantage of this method reduces computation time of original process and gives similar results. Proposed is applied to several test data set and simulation result with conceptual analysis is presented.

  • PDF

Sampling-Based Automated Parameter Estimation for Canopy Clustering (샘플링 기반 Canopy Clustering 파라미터 설정 기법)

  • Choi, Sung-Woon;Yu, Seung-Hak;Yoon, Sung-Roh
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2012.06b
    • /
    • pp.438-440
    • /
    • 2012
  • 대용량 데이터를 효율적으로 군집화하기위해 개발된 Canopy Clustering은 2개의 파라미터 (T1, T2)에 기반하여 Canopy 형성이 결정되며, 결과적으로 이들 파라미터에 의해 군집화 결과가 크게 달라질 수 있다. 이에 따라 데이터의 특성을 잘 반영하는 파라미터 값을 적절히 선택하는 것이 매우 중요하지만, 자동화된 파라미터 설정 기법의 부재로 인하여, 기존 연구에서는 사용자의 경험에 의하여 Canopy Clustering의 파라미터 값을 설정하는 것이 일반적이었다. 본 논문에서는 통계적 샘플링을 이용하여 T1, T2의 값을 효과적으로 설정하는 방법을 제안한다.