• Title/Summary/Keyword: k-means clustering Algorithm

Search Result 545, Processing Time 0.028 seconds

The Evaluation Measure of Text Clustering for the Variable Number of Clusters (가변적 클러스터 개수에 대한 문서군집화 평가방법)

  • Jo, Tae-Ho
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2006.10b
    • /
    • pp.233-237
    • /
    • 2006
  • This study proposes an innovative measure for evaluating the performance of text clustering. In using K-means algorithm and Kohonen Networks for text clustering, the number clusters is fixed initially by configuring it as their parameter, while in using single pass algorithm for text clustering, the number of clusters is not predictable. Using labeled documents, the result of text clustering using K-means algorithm or Kohonen Network is able to be evaluated by setting the number of clusters as the number of the given target categories, mapping each cluster to a target category, and using the evaluation measures of text. But in using single pass algorithm, if the number of clusters is different from the number of target categories, such measures are useless for evaluating the result of text clustering. This study proposes an evaluation measure of text clustering based on intra-cluster similarity and inter-cluster similarity, what is called CI (Clustering Index) in this article.

  • PDF

An Improved Hybrid Canopy-Fuzzy C-Means Clustering Algorithm Based on MapReduce Model

  • Dai, Wei;Yu, Changjun;Jiang, Zilong
    • Journal of Computing Science and Engineering
    • /
    • v.10 no.1
    • /
    • pp.1-8
    • /
    • 2016
  • The fuzzy c-means (FCM) is a frequently utilized algorithm at present. Yet, the clustering quality and convergence rate of FCM are determined by the initial cluster centers, and so an improved FCM algorithm based on canopy cluster concept to quickly analyze the dataset has been proposed. Taking advantage of the canopy algorithm for its rapid acquisition of cluster centers, this algorithm regards the cluster results of canopy as the input. In this way, the convergence rate of the FCM algorithm is accelerated. Meanwhile, the MapReduce scheme of the proposed FCM algorithm is designed in a cloud environment. Experimental results demonstrate the hybrid canopy-FCM clustering algorithm processed by MapReduce be endowed with better clustering quality and higher operation speed.

Semantic-Based K-Means Clustering for Microblogs Exploiting Folksonomy

  • Heu, Jee-Uk
    • Journal of Information Processing Systems
    • /
    • v.14 no.6
    • /
    • pp.1438-1444
    • /
    • 2018
  • Recently, with the development of Internet technologies and propagation of smart devices, use of microblogs such as Facebook, Twitter, and Instagram has been rapidly increasing. Many users check for new information on microblogs because the content on their timelines is continually updating. Therefore, clustering algorithms are necessary to arrange the content of microblogs by grouping them for a user who wants to get the newest information. However, microblogs have word limits, and it has there is not enough information to analyze for content clustering. In this paper, we propose a semantic-based K-means clustering algorithm that not only measures the similarity between the data represented as a vector space model, but also measures the semantic similarity between the data by exploiting the TagCluster for clustering. Through the experimental results on the RepLab2013 Twitter dataset, we show the effectiveness of the semantic-based K-means clustering algorithm.

K-means based Clustering Method with a Fixed Number of Cluster Members

  • Yi, Faliu;Moon, Inkyu
    • Journal of Korea Multimedia Society
    • /
    • v.17 no.10
    • /
    • pp.1160-1170
    • /
    • 2014
  • Clustering methods are very useful in many fields such as data mining, classification, and object recognition. Both the supervised and unsupervised grouping approaches can classify a series of sample data with a predefined or automatically assigned cluster number. However, there is no constraint on the number of elements for each cluster. Numbers of cluster members for each cluster obtained from clustering schemes are usually random. Thus, some clusters possess a large number of elements whereas others only have a few members. In some areas such as logistics management, a fixed number of members are preferred for each cluster or logistic center. Consequently, it is necessary to design a clustering method that can automatically adjust the number of group elements. In this paper, a k-means based clustering method with a fixed number of cluster members is proposed. In the proposed method, first, the data samples are clustered using the k-means algorithm. Then, the number of group elements is adjusted by employing a greedy strategy. Experimental results demonstrate that the proposed clustering scheme can classify data samples efficiently for a fixed number of cluster members.

Data Clustering Method Using a Modified Gaussian Kernel Metric and Kernel PCA

  • Lee, Hansung;Yoo, Jang-Hee;Park, Daihee
    • ETRI Journal
    • /
    • v.36 no.3
    • /
    • pp.333-342
    • /
    • 2014
  • Most hyper-ellipsoidal clustering (HEC) approaches use the Mahalanobis distance as a distance metric. It has been proven that HEC, under this condition, cannot be realized since the cost function of partitional clustering is a constant. We demonstrate that HEC with a modified Gaussian kernel metric can be interpreted as a problem of finding condensed ellipsoidal clusters (with respect to the volumes and densities of the clusters) and propose a practical HEC algorithm that is able to efficiently handle clusters that are ellipsoidal in shape and that are of different size and density. We then try to refine the HEC algorithm by utilizing ellipsoids defined on the kernel feature space to deal with more complex-shaped clusters. The proposed methods lead to a significant improvement in the clustering results over K-means algorithm, fuzzy C-means algorithm, GMM-EM algorithm, and HEC algorithm based on minimum-volume ellipsoids using Mahalanobis distance.

Repeated Clustering to Improve the Discrimination of Typical Daily Load Profile

  • Kim, Young-Il;Ko, Jong-Min;Song, Jae-Ju;Choi, Hoon
    • Journal of Electrical Engineering and Technology
    • /
    • v.7 no.3
    • /
    • pp.281-287
    • /
    • 2012
  • The customer load profile clustering method is used to make the TDLP (Typical Daily Load Profile) to estimate the quarter hourly load profile of non-AMR (Automatic Meter Reading) customers. This study examines how the repeated clustering method improves the ability to discriminate among the TDLPs of each cluster. The k-means algorithm is a well-known clustering technology in data mining. Repeated clustering groups the cluster into sub-clusters with the k-means algorithm and chooses the sub-cluster that has the maximum average error and repeats clustering until the final cluster count is satisfied.

COUNTING OF FLOWERS BASED ON K-MEANS CLUSTERING AND WATERSHED SEGMENTATION

  • PAN ZHAO;BYEONG-CHUN SHIN
    • Journal of the Korean Society for Industrial and Applied Mathematics
    • /
    • v.27 no.2
    • /
    • pp.146-159
    • /
    • 2023
  • This paper proposes a hybrid algorithm combining K-means clustering and watershed algorithms for flower segmentation and counting. We use the K-means clustering algorithm to obtain the main colors in a complex background according to the cluster centers and then take a color space transformation to extract pixel values for the hue, saturation, and value of flower color. Next, we apply the threshold segmentation technique to segment flowers precisely and obtain the binary image of flowers. Based on this, we take the Euclidean distance transformation to obtain the distance map and apply it to find the local maxima of the connected components. Afterward, the proposed algorithm adaptively determines a minimum distance between each peak and apply it to label connected components using the watershed segmentation with eight-connectivity. On a dataset of 30 images, the test results reveal that the proposed method is more efficient and precise for the counting of overlapped flowers ignoring the degree of overlap, number of overlap, and relatively irregular shape.

Hybrid Simulated Annealing for Data Clustering (데이터 클러스터링을 위한 혼합 시뮬레이티드 어닐링)

  • Kim, Sung-Soo;Baek, Jun-Young;Kang, Beom-Soo
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.40 no.2
    • /
    • pp.92-98
    • /
    • 2017
  • Data clustering determines a group of patterns using similarity measure in a dataset and is one of the most important and difficult technique in data mining. Clustering can be formally considered as a particular kind of NP-hard grouping problem. K-means algorithm which is popular and efficient, is sensitive for initialization and has the possibility to be stuck in local optimum because of hill climbing clustering method. This method is also not computationally feasible in practice, especially for large datasets and large number of clusters. Therefore, we need a robust and efficient clustering algorithm to find the global optimum (not local optimum) especially when much data is collected from many IoT (Internet of Things) devices in these days. The objective of this paper is to propose new Hybrid Simulated Annealing (HSA) which is combined simulated annealing with K-means for non-hierarchical clustering of big data. Simulated annealing (SA) is useful for diversified search in large search space and K-means is useful for converged search in predetermined search space. Our proposed method can balance the intensification and diversification to find the global optimal solution in big data clustering. The performance of HSA is validated using Iris, Wine, Glass, and Vowel UCI machine learning repository datasets comparing to previous studies by experiment and analysis. Our proposed KSAK (K-means+SA+K-means) and SAK (SA+K-means) are better than KSA(K-means+SA), SA, and K-means in our simulations. Our method has significantly improved accuracy and efficiency to find the global optimal data clustering solution for complex, real time, and costly data mining process.

Analysis of COVID-19 Context-awareness based on Clustering Algorithm (클러스터링 알고리즘기반의 COVID-19 상황인식 분석)

  • Lee, Kangwhan
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.5
    • /
    • pp.755-762
    • /
    • 2022
  • This paper propose a clustered algorithm that possible more efficient COVID-19 disease learning prediction within clustering using context-aware attribute information. In typically, clustering of COVID-19 diseases provides to classify interrelationships within disease cluster information in the clustering process. The clustering data will be as a degrade factor if new or newly processing information during treated as contaminated factors in comparative interrelationships information. In this paper, we have shown the solving the problems and developed a clustering algorithm that can extracting disease correlation information in using K-means algorithm. According to their attributes from disease clusters using accumulated information and interrelationships clustering, the proposed algorithm analyzes the disease correlation clustering possible and centering points. The proposed algorithm showed improved adaptability to prediction accuracy of the classification management system in terms of learning as a group of multiple disease attribute information of COVID-19 through the applied simulation results.

K-means clustering using a center of gravity for grid-based sample (그리드 기반 표본의 무게중심을 이용한 케이-평균군집화)

  • Lee, Sun-Myung;Park, Hee-Chang
    • Journal of the Korean Data and Information Science Society
    • /
    • v.21 no.1
    • /
    • pp.121-128
    • /
    • 2010
  • K-means clustering is an iterative algorithm in which items are moved among sets of clusters until the desired set is reached. K-means clustering has been widely used in many applications, such as market research, pattern analysis or recognition, image processing, etc. It can identify dense and sparse regions among data attributes or object attributes. But k-means algorithm requires many hours to get k clusters that we want, because it is more primitive, explorative. In this paper we propose a new method of k-means clustering using a center of gravity for grid-based sample. It is more fast than any traditional clustering method and maintains its accuracy.