Search | Korea Science

A Performance Comparison of Cluster Validity Indices based on K-means Algorithm (K-means 알고리즘 기반 클러스터링 인덱스 비교 연구)

Shim, Yo-Sung;Chung, Ji-Won;Choi, In-Chan
- Asia pacific journal of information systems
- /
- v.16 no.1
- /
- pp.127-144
- /
- 2006
The K-means algorithm is widely used at the initial stage of data analysis in data mining process, partly because of its low time complexity and the simplicity of practical implementation. Cluster validity indices are used along with the algorithm in order to determine the number of clusters as well as the clustering results of datasets. In this paper, we present a performance comparison of sixteen indices, which are selected from forty indices in literature, while considering their applicability to nonhierarchical clustering algorithms. Data sets used in the experiment are generated based on multivariate normal distribution. In particular, four error types including standardization, outlier generation, error perturbation, and noise dimension addition are considered in the comparison. Through the experiment the effects of varying number of points, attributes, and clusters on the performance are analyzed. The result of the simulation experiment shows that Calinski and Harabasz index performs the best through the all datasets and that Davis and Bouldin index becomes a strong competitor as the number of points increases in dataset.
PDF KSCI

On the clustering of huge categorical data

Kim, Dae-Hak
- Journal of the Korean Data and Information Science Society
- /
- v.21 no.6
- /
- pp.1353-1359
- /
- 2010
Basic objective in cluster analysis is to discover natural groupings of items. In general, clustering is conducted based on some similarity (or dissimilarity) matrix or the original input data. Various measures of similarities between objects are developed. In this paper, we consider a clustering of huge categorical real data set which shows the aspects of time-location-activity of Korean people. Some useful similarity measure for the data set, are developed and adopted for the categorical variables. Hierarchical and nonhierarchical clustering method are applied for the considered data set which is huge and consists of many categorical variables.
PDF KSCI

Double K-Means Clustering (이중 K-평균 군집화)

허명회
- The Korean Journal of Applied Statistics
- /
- v.13 no.2
- /
- pp.343-352
- /
- 2000
In this study. the author proposes a nonhierarchical clustering method. called the "Double K-Means Clustering", which performs clustering of multivariate observations with the following algorithm: Step I: Carry out the ordinary K-means clmitering and obtain k temporary clusters with sizes $n_1$,... , $n_k$, centroids $c_$1,..., $c_k$ and pooled covariance matrix S. $\bullet$ Step II-I: Allocate the observation x, to the cluster F if it satisfies ..... where N is the total number of observations, for -i = 1, . ,N. $\bullet$ Step II-2: Update cluster sizes $n_1$,... , $n_k$, centroids $c_$1,..., $c_k$ and pooled covariance matrix S. $\bullet$ Step II-3: Repeat Steps II-I and II-2 until the change becomes negligible. The double K-means clustering is nearly "optimal" under the mixture of k multivariate normal distributions with the common covariance matrix. Also, it is nearly affine invariant, with the data-analytic implication that variable standardizations are not that required. The method is numerically demonstrated on Fisher's iris data.
PDF

Cluster-based Information Retrieval with Tolerance Rough Set Model

Ho, Tu-Bao;Kawasaki, Saori;Nguyen, Ngoc-Binh
- International Journal of Fuzzy Logic and Intelligent Systems
- /
- v.2 no.1
- /
- pp.26-32
- /
- 2002
The objectives of this paper are twofold. First is to introduce a model for representing documents with semantics relatedness using rough sets but with tolerance relations instead of equivalence relations (TRSM). Second is to introduce two document hierarchical and nonhierarchical clustering algorithms based on this model and TRSM cluster-based information retrieval using these two algorithms. The experimental results show that TRSM offers an alterative approach to text clustering and information retrieval.
https://doi.org/10.5391/IJFIS.2002.2.1.026 인용 PDF KSCI

Clustering Technique for Multivariate Data Analysis

Lee, Jin-Ki
- Journal of the military operations research society of Korea
- /
- v.6 no.2
- /
- pp.89-127
- /
- 1980
The multivariate analysis techniques of cluster analysis are examined in this article. The theory and applications of the techniques and computer software concerning these techniques are discussed and sample jobs are included. A hierarchical cluster analysis algorithm, available in the IMSL software package, is applied to a set of data extracted from a group of subjects for the purpose of partitioning a collection of 26 attributes of a weapon system into six clusters of superattributes. A nonhierarchical clustering procedure were applied to a collection of data of tanks considering of twenty-four observations of ten attributes of tanks. The cluster analysis shows that the tanks cluster somewhat naturally by nationality. The principal componant analysis and the discriminant analysis show that tank weight is the single most important discriminator among nationality although they are not shown in this article because of the space restriction. This is a part of thesis for master's degree in operations research.
PDF

Automated K-Means Clustering and R Implementation (자동화 K-평균 군집방법 및 R 구현)

Kim, Sung-Soo
- The Korean Journal of Applied Statistics
- /
- v.22 no.4
- /
- pp.723-733
- /
- 2009
The crucial problems of K-means clustering are deciding the number of clusters and initial centroids of clusters. Hence, the steps of K-means clustering are generally consisted of two-stage clustering procedure. The first stage is to run hierarchical clusters to obtain the number of clusters and cluster centroids and second stage is to run nonhierarchical K-means clustering using the results of first stage. Here we provide automated K-means clustering procedure to be useful to obtain initial centroids of clusters which can also be useful for large data sets, and provide software program implemented using R.
https://doi.org/10.5351/KJAS.2009.22.4.723 인용 PDF KSCI

A Comparative Study on Statistical Clustering Methods and Kohonen Self-Organizing Maps for Highway Characteristic Classification of National Highway (일반국도 도로특성분류를 위한 통계적 군집분석과 Kohonen Self-Organizing Maps의 비교연구)

Cho, Jun Han;Kim, Seong Ho
- KSCE Journal of Civil and Environmental Engineering Research
- /
- v.29 no.3D
- /
- pp.347-356
- /
- 2009
This paper is described clustering analysis of traffic characteristics-based highway classification in order to deviate from methodologies of existing highway functional classification. This research focuses on comparing the clustering techniques performance based on the total within-group errors and deriving the optimal number of cluster. This research analyzed statistical clustering method (Hierarchical Ward's minimum-variance method, Nonhierarchical K-means method) and Kohonen self-organizing maps clustering method for highway characteristic classification. The outcomes of cluster techniques compared for the number of samples and traffic characteristics from subsets derived by the optimal number of cluster. As a comprehensive result, the k-means method is superior result to other methods less than 12. For a cluster of more than 20, Kohonen self-organizing maps is the best result in the cluster method. The main contribution of this research is expected to use important the basic road attribution information that produced the highway characteristic classification.
https://doi.org/10.12652/Ksce.2009.29.3D.347 인용 PDF

A Study of User Interests and Tag Classification related to resources in a Social Tagging System (소셜 태깅에서 관심사로 바라본 태그 특징 연구 - 소셜 북마킹 사이트 'del.icio.us'의 태그를 중심으로 -)

Bae, Joo-Hee;Lee, Kyung-Won
- 한국HCI학회:학술대회논문집
- /
- 2009.02a
- /
- pp.826-833
- /
- 2009
Currently, the rise of social tagging has changing taxonomy to folksonomy. Tag represents a new approach to organizing information. Nonhierarchical classification allows data to be freely gathered, allows easy access, and has the ability to move directly to other content topics. Tag is expected to play a key role in clustering various types of contents, it is expand to network in the common interests among users. First, this paper determine the relationships among user, tags and resources in social tagging system and examine the circumstances of what aspects to users when creating a tag related to features of websites. Therefore, this study uses tags from the social bookmarking service 'del.icio.us' to analyze the features of tag words when adding a new web page to a list. To do this, websites features classified into 7 items, it is known as tag classification related to resources. Experiments were conducted to test the proposed classify method in the area of music, photography and games. This paper attempts to investigate the perspective in which users apply a tag to a webpage and establish the capacity of expanding a social service that offers the opportunity to create a new business model.
PDF

Search Result 8, Processing Time 0.024 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)