• Title/Summary/Keyword: data clustering

Search Result 2,729, Processing Time 0.037 seconds

An Overview of Unsupervised and Semi-Supervised Fuzzy Kernel Clustering

  • Frigui, Hichem;Bchir, Ouiem;Baili, Naouel
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.13 no.4
    • /
    • pp.254-268
    • /
    • 2013
  • For real-world clustering tasks, the input data is typically not easily separable due to the highly complex data structure or when clusters vary in size, density and shape. Kernel-based clustering has proven to be an effective approach to partition such data. In this paper, we provide an overview of several fuzzy kernel clustering algorithms. We focus on methods that optimize an fuzzy C-mean-type objective function. We highlight the advantages and disadvantages of each method. In addition to the completely unsupervised algorithms, we also provide an overview of some semi-supervised fuzzy kernel clustering algorithms. These algorithms use partial supervision information to guide the optimization process and avoid local minima. We also provide an overview of the different approaches that have been used to extend kernel clustering to handle very large data sets.

More Efficient k-Modes Clustering Algorithm

  • Kim, Dae-Won;Chae, Yi-Geun
    • Journal of the Korean Data and Information Science Society
    • /
    • v.16 no.3
    • /
    • pp.549-556
    • /
    • 2005
  • A hard-type centroids in the conventional clustering algorithm such as k-modes algorithm cannot keep the uncertainty inherently in data sets as long as possible before actual clustering(decision) are made. Therefore, we propose the k-populations algorithm to extend clustering ability and to heed the data characteristics. This k-population algorithm as found to give markedly better clustering results through various experiments.

  • PDF

Consensus Clustering for Time Course Gene Expression Microarray Data

  • Kim, Seo-Young;Bae, Jong-Sung
    • Communications for Statistical Applications and Methods
    • /
    • v.12 no.2
    • /
    • pp.335-348
    • /
    • 2005
  • The rapid development of microarray technologies enabled the monitoring of expression levels of thousands of genes simultaneously. Recently, the time course gene expression data are often measured to study dynamic biological systems and gene regulatory networks. For the data, biologists are attempting to group genes based on the temporal pattern of their expression levels. We apply the consensus clustering algorithm to a time course gene expression data in order to infer statistically meaningful information from the measurements. We evaluate each of consensus clustering and existing clustering methods with various validation measures. In this paper, we consider hierarchical clustering and Diana of existing methods, and consensus clustering with hierarchical clustering, Diana and mixed hierachical and Diana methods and evaluate their performances on a real micro array data set and two simulated data sets.

Customer Load Pattern Analysis using Clustering Techniques (클러스터링 기법을 이용한 수용가별 전력 데이터 패턴 분석)

  • Ryu, Seunghyoung;Kim, Hongseok;Oh, Doeun;No, Jaekoo
    • KEPCO Journal on Electric Power and Energy
    • /
    • v.2 no.1
    • /
    • pp.61-69
    • /
    • 2016
  • Understanding load patterns and customer classification is a basic step in analyzing the behavior of electricity consumers. To achieve that, there have been many researches about clustering customers' daily load data. Nowadays, the deployment of advanced metering infrastructure (AMI) and big-data technologies make it easier to study customers' load data. In this paper, we study load clustering from the view point of yearly and daily load pattern. We compare four clustering methods; K-means clustering, hierarchical clustering (average & Ward's method) and DBSCAN (Density-Based Spatial Clustering of Applications with Noise). We also discuss the relationship between clustering results and Korean Standard Industrial Classification that is one of possible labels for customers' load data. We find that hierarchical clustering with Ward's method is suitable for clustering load data and KSIC can be well characterized by daily load pattern, but not quite well by yearly load pattern.

Double monothetic clustering for histogram-valued data

  • Kim, Jaejik;Billard, L.
    • Communications for Statistical Applications and Methods
    • /
    • v.25 no.3
    • /
    • pp.263-274
    • /
    • 2018
  • One of the common issues in large dataset analyses is to detect and construct homogeneous groups of objects in those datasets. This is typically done by some form of clustering technique. In this study, we present a divisive hierarchical clustering method for two monothetic characteristics of histogram data. Unlike classical data points, a histogram has internal variation of itself as well as location information. However, to find the optimal bipartition, existing divisive monothetic clustering methods for histogram data consider only location information as a monothetic characteristic and they cannot distinguish histograms with the same location but different internal variations. Thus, a divisive clustering method considering both location and internal variation of histograms is proposed in this study. The method has an advantage in interpreting clustering outcomes by providing binary questions for each split. The proposed clustering method is verified through a simulation study and applied to a large U.S. house property value dataset.

Approximate Clustering on Data Streams Using Discrete Cosine Transform

  • Yu, Feng;Oyana, Damalie;Hou, Wen-Chi;Wainer, Michael
    • Journal of Information Processing Systems
    • /
    • v.6 no.1
    • /
    • pp.67-78
    • /
    • 2010
  • In this study, a clustering algorithm that uses DCT transformed data is presented. The algorithm is a grid density-based clustering algorithm that can identify clusters of arbitrary shape. Streaming data are transformed and reconstructed as needed for clustering. Experimental results show that DCT is able to approximate a data distribution efficiently using only a small number of coefficients and preserve the clusters well. The grid based clustering algorithm works well with DCT transformed data, demonstrating the viability of DCT for data stream clustering applications.

Transactions Clustering based on Item Similarity (아이템의 유사도를 고려한 트랜잭션 클러스터링)

  • 이상욱;김재련
    • Proceedings of the Korea Inteligent Information System Society Conference
    • /
    • 2002.11a
    • /
    • pp.250-257
    • /
    • 2002
  • Clustering is a data mining method, which consists in discovering interesting data distributions in very large databases. In traditional data clustering, similarity of a cluster of object is measured by pairwise similarity of objects in that paper. In view of the nature of clustering transactions, we devise in this paper a novel measurement called item similarity and utilize this to perform clustering. With this item similarity measurement, we develop an efficient clustering algorithm for target marketing in each group.

  • PDF

Descriptive and Systematic Comparison of Clustering Methods in Microarray Data Analysis

  • Kim, Seo-Young
    • The Korean Journal of Applied Statistics
    • /
    • v.22 no.1
    • /
    • pp.89-106
    • /
    • 2009
  • There have been many new advances in the development of improved clustering methods for microarray data analysis, but traditional clustering methods are still often used in genomic data analysis, which maY be more due to their conceptual simplicity and their broad usability in commercial software packages than to their intrinsic merits. Thus, it is crucial to assess the performance of each existing method through a comprehensive comparative analysis so as to provide informed guidelines on choosing clustering methods. In this study, we investigated existing clustering methods applied to microarray data in various real scenarios. To this end, we focused on how the various methods differ, and why a particular method does not perform well. We applied both internal and external validation methods to the following eight clustering methods using various simulated data sets and real microarray data sets.

EXTENDED ONLINE DIVISIVE AGGLOMERATIVE CLUSTERING

  • Musa, Ibrahim Musa Ishag;Lee, Dong-Gyu;Ryu, Keun-Ho
    • Proceedings of the KSRS Conference
    • /
    • 2008.10a
    • /
    • pp.406-409
    • /
    • 2008
  • Clustering data streams has an importance over many applications like sensor networks. Existing hierarchical methods follow a semi fuzzy clustering that yields duplicate clusters. In order to solve the problems, we propose an extended online divisive agglomerative clustering on data streams. It builds a tree-like top-down hierarchy of clusters that evolves with data streams using geometric time frame for snapshots. It is an enhancement of the Online Divisive Agglomerative Clustering (ODAC) with a pruning strategy to avoid duplicate clusters. Our main features are providing update time and memory space which is independent of the number of examples on data streams. It can be utilized for clustering sensor data and network monitoring as well as web click streams.

  • PDF

A Clustering Tool Using Particle Swarm Optimization for DNA Chip Data

  • Han, Xiaoyue;Lee, Min-Soo
    • Genomics & Informatics
    • /
    • v.9 no.2
    • /
    • pp.89-91
    • /
    • 2011
  • DNA chips are becoming increasingly popular as a convenient way to perform vast amounts of experiments related to genes on a single chip. And the importance of analyzing the data that is provided by such DNA chips is becoming significant. A very important analysis on DNA chip data would be clustering genes to identify gene groups which have similar properties such as cancer. Clustering data for DNA chips usually deal with a large search space and has a very fuzzy characteristic. The Particle Swarm Optimization algorithm which was recently proposed is a very good candidate to solve such problems. In this paper, we propose a clustering mechanism that is based on the Particle Swarm Optimization algorithm. Our experiments show that the PSO-based clustering algorithm developed is efficient in terms of execution time for clustering DNA chip data, and thus be used to extract valuable information such as cancer related genes from DNA chip data with high cluster accuracy and in a timely manner.