DOI QR코드

DOI QR Code

Approximate k values using Repulsive Force without Domain Knowledge in k-means

  • Kim, Jung-Jae (Process and Engineering Research Lab. Control and Instrumentation Research Group, POSCO) ;
  • Ryu, Minwoo (Service Laboratoire Institute of Convergence Technology, KT R&D Center) ;
  • Cha, Si-Ho (Departement of Multimedia Science, Chungwoon University)
  • 투고 : 2019.07.30
  • 심사 : 2020.01.14
  • 발행 : 2020.03.31

초록

The k-means algorithm is widely used in academia and industry due to easy and simple implementation, enabling fast learning for complex datasets. However, k-means struggles to classify datasets without prior knowledge of specific domains. We proposed the repulsive k-means (RK-means) algorithm in a previous study to improve the k-means algorithm, using the repulsive force concept, which allows deleting unnecessary cluster centroids. Accordingly, the RK-means enables to classifying of a dataset without domain knowledge. However, three main problems remain. The RK-means algorithm includes a cluster repulsive force offset, for clusters confined in other clusters, which can cause cluster locking; we were unable to prove RK-means provided optimal convergence in the previous study; and RK-means shown better performance only normalize term and weight. Therefore, this paper proposes the advanced RK-means (ARK-means) algorithm to resolve the RK-means problems. We establish an initialization strategy for deploying cluster centroids and define a metric for the ARK-means algorithm. Finally, we redefine the mass and normalize terms to close to the general dataset. We show ARK-means feasibility experimentally using blob and iris datasets. Experiment results verify the proposed ARK-means algorithm provides better performance than k-means, k'-means, and RK-means.

키워드

1. Introduction

The k-means algorithm is widely used in academia and industry due to easy and simple implementation, enabling fast learning for complex datasets [1-3]. K-means have been employed in many research areas, including big data, machine learning, and data mining [4-6]. However, k-means converges to numerous local minima under iterative clustering and has a high dependency upon the initial k cluster centers. The initial cluster centroids are usually selected randomly, but to ensure results quality, they should be selected using domain knowledge [7-9]. Unfortunately, practical datasets are often created from diverse domain areas,including smart cities, healthcare, and IoT, etc., forming very large datasets. Thus, it is difficult to select appropriate cluster values that reflect domain knowledge. Therefore, we propose a method to drive approximate k value without domain knowledge.

In a previous study, we proposed the repulsive k-means (RK-means) algorithm [10] to resolve this problem. We assumed that the cluster centroids had repulsive forces between them,where each cluster had a mass proportion to the sum of errors between data and centroids. The key concept for RK-means was that clusters with larger mass represented a single data group better. Consequently, cluster centroids with larger mass would tend to maintain their current position, whereas cluster centroid with smaller mass would tend to search for positions to acquire larger mass. Thus, smaller mass cluster centroids were “pushed” by larger centroids,and hence the concept of a repulsive force, until these smaller clusters became amalgamated within larger mass clusters or empty. Empty cluster centroids were then deleted. TheRK-means algorithm iterated this process and could accurately classify a dataset without requiring specific domain knowledge.

However, although the RK-means algorithm resolved the k-means algorithm problem, three problems remained.

• The RK-means algorithm includes an offset to the cluster repulsive force where a cluster is confined within other clusters, which can cause cluster locking.

• We were unable to prove RK-means generated optimal convergence as optimal in the previous study.

• RK-means showed better performance only normalize terms and weight.

Therefore, we propose an advanced RK-means (ARK-means) algorithm to resolve theseRK-means problems, establishing an initial strategy for deploying cluster centroids to resolve cluster locking. We also prove ARK-means convergence to be locally optimal, and redefine cluster mass and normalize terms to close to a general dataset.

We show ARK-means algorithm feasibility experimentally compared with RK-means and k’-means algorithms using blob and iris datasets. Experimental results verify that the proposed ARK-means algorithm provides more accurate performance than the other algorithms.

The rest of this paper is organized as follows. Section 2 discusses related work, including the methods to find k without domain knowledge, and briefly introduces RK-means. Section 3details the proposed ARK-means algorithm, and Section 4 presents the experimental resultsand subsequent performance evaluation. Section 5 summarizes and concludes the paper with remarks on future study directions.

2. Related Work

2.1 Selecting k without domain knowledge

The simplest approach to select appropriate k without domain knowledge is to use heuristic methods, i.e., gradually increase the initial cluster number. However, this approach rapidly becomes prohibitively computationally and tie expensive as dataset size increases [11,12]. Consequently, many previous studies have investigated how clustering could be accomplished with high accuracy without domain knowledge based on the finite mixture model [13-16].

Cheung et al. proposed rival penalized competitive learning (RPCL) based on competitive learning [17] to perform data clustering without knowing the exact number of clusters. They considered two centroids, winner and rival, the winner was the closest centroid and the rival was the second closest centroid for each datapoint. The rival was then reduced using the delearning rate factor to select the appropriate centroid. Although RPCL could perform data clustering without knowing the initial cluster number, the performance was not only sensitive to the preselected rival delearning rate but also could not guarantee convergence as optimal.

Ma et al. [18] and Xie et al. [19] proposed rival penalized controlled competitive learning(RPCCL) and distance sensitive rival penalized competitive learning (DSRPCL), respectively, to improve the RPCL algorithm. RPCCL used a weight calculated from the distance between clusters to resolve the RPCL rival delearning rate, and DSRPCL was able to guarantee convergence as optimal using a cost function generalized from RPCL. These proposed approaches achieved better performance than previous algorithms, and hence have been used in diverse research areas [20-23].

Žalik et al. proposed the k’-means algorithm to improve the k-means algorithm [24]. Unlike existing the k-means algorithm, the k’-means algorithm composed of two main phases. In the first phase, the k’-means algorithm performs initial clustering and then, it executes preprocessing for assigning seed points into each cluster. Finally, the cost function assigned seed point is adjusted as a minimum. In this phase, the cluster with more data was selected as the winner, and clusters with fewer data became a centroid for an empty cluster, which were subsequently excluded from being winner candidates. Thus, k’-means selected the initial number of centroids without domain knowledge.

Arthur et.al provided the k-means++ algorithm to resolve the learning results of unstable clustering in exiting the k-means algorithm [25]. To this end, the proposed k-means++ adjust sampling probability distribution. In other words, a point that has a larger distance thanpre-selected points is selected as a higher probability in the phase of selecting initial points. Accordingly, initial points are selected as a point that has a larger distance than pre-selected points. However, the k-means++ algorithm cannot resolve the cluster locking problem as mention in Section 2.2. Consequently, we need a strategy of selecting the initial point to resolve clustering locking problem.

2.2 RK-means Algorithm

This section introduces our previously proposed repulsive k-means (RK-means) algorithm, which basically follows the classic k-means algorithm: (1) k initial centroids are randomly chosen from the dataset; (2) data points closest to each centroid create a cluster; (3) each centroid is moved to mean point of cluster; and (4) repeat steps (2) and (3) until the centroids stop moving. Unlike k’-means algorithm, the RK-means algorithm automatically creates empty clusters and then deletes the points using distance function which is consists of three phases.

In these steps, RK-means changes step (3) to divide the dataset into N clusters, #, where each cluster consists of a d-dimensional vector x = {x1, x2, ..., xd}. The algorithm uses the same membership function as k-means,

\(\gamma_{n k}=\left\{\begin{array}{c} 1 \text { where } k=\underset{k^{\prime}}{\text{argmin}}\left(\mathbf{x}_{n}-\boldsymbol{\mu}_{k^{\prime}}\right)^{2}\\ 0 & \text { others } \end{array}\right.\)       (1)

where μ is a d-dimensional vector that represents the current centroid position.

However, the centroid position is defined by the sum of mean positions of data points included in a cluster,

\(\boldsymbol{\mu}_{\mathrm{k}}=\frac{\sum_{n=1}^{N} \gamma_{n k} \mathbf{x}_{n}}{\sum_{n=1}^{N} \gamma_{n k}}+D_{k}\)       (2)

where the Dk vector includes distance and direction of the kth repulsive force from another centroid,

\(D_{k}=\sum_{k^{\prime} \neq k} D_{k k^{\prime}}=\sum_{k^{\prime} \neq k} C \frac{1}{\left\|\mu_{k}-\mu_{k^{\prime}}\right\|^{2}} \cdot \frac{1}{M_{k}} \cdot \frac{\mu_{k}-\mu_{k^{\prime}}}{\left\|\mu_{k}-\mu_{k^{\prime}}\right\|}\)       (3)

where C is a normalizing term; the rightmost term is the direction vector of repulsive force from other centroids; and 𝑀𝑘 is mass of kth cluster,

\(M_{k}=\frac{J_{k}}{J_{k^{\prime}}}=\frac{\sum_{n=1}^{N} \gamma_{n k}\left\|\boldsymbol{\mu}_{k}-\mathbf{x}_{n}\right\|}{\sum_{n=1}^{N} \gamma_{n k^{\prime}}\left\|\boldsymbol{\mu}_{k^{\prime}} -\mathbf{x}_{n}\right\|}\)       (4)

where 𝐽 is the sum of errors for the cluster.

Since RK-means cluster mass is a ratio relative to other clusters, we need to normalize the physical quantity. The sum of errors overall clusters, # is an appropriate normalization, but we use the reciprocal of # because distance moved by an object is inversely proportional to the mass, i.e.,

\(C=\frac{1}{\sum_{k} J_{k}}\)       (5)

Eq. (2) allows a centroid with the largest mass to completely capture a data group and Eq. (5) ensures the repulsive force is normalized as it changes in every step.

Although the RK-means algorithm improves k-means using repulsive force, three issues remain, as discussed in Section 1. Therefore, we proposed ARK-means, based on the existing RK-means algorithm to address cluster locking, proving convergence, and redefine cluster mass and normalization.

3. Advanced Repulsive Force k-means Algorithm

3.1 Motivation

This section discusses the motivation to apply the repulsive force concept to clustering problems.

Suppose we place several magnets on a table, as shown in Fig. 1, where the outer rectangle represents the table, circles represent magnet position, central diameter represents their mass, and the dotted circles represent repulsive force. We assume all magnets have the same characteristics aside from their mass, i.e., density and magnetic flux density, and the magnetic field strength is proportional to magnet size (mass).

E1KOBZ_2020_v14n3_976_f0001.png 이미지

Fig. 1. Magnet position changes due to mutual repulsive forces

Fig. 1-(b) shows the magnets moved to non-overlapping field positions after some time, pushed by their repulsive force (we do not consider inertia, gravitation or friction). In particular, the smallest mass magnet is pushed furthest from its initial position.

The RK-means algorithm applies this of the repulsive force concept to clustering. Each cluster is considered as a magnet in vector space, where datasets are divided into several clusters. Consequently, cluster centroids are pushed between clusters, and each centroid creates a new cluster in the new position or is pushed out another position. Finally, centroids with smaller mass amalgamate with a data group of clusters with larger mass or become empty, and empty clusters are deleted. Thus, we can classify a dataset without an initial k derived from domain knowledge.

3.2 Initialization

Fig. 2 demonstrates the RK-means cluster locking problem. Since the repulsive force is represented as a vector the number of centroids depends on their initial position. RK-means selects initial centroid positions randomly. Hence, a cluster with small mass could be surrounded by larger mass clusters (Fig. 2), with the repulsive force from any larger cluster offset by force from other clusters. Consequently, after several time steps, the smaller cluster remains in the vector space rather than becoming empty or converging another data group. This creates a poor cluster and excludes the cluster members from the opportunity to explore.

E1KOBZ_2020_v14n3_976_f0002.png 이미지

Fig. 2. RK-means cluster locking problem

Therefore, the proposed ARK-means algorithm randomly selects initial cluster values within a sufficiently small space (initial space), defined from the overall mean data position. Thus, no poor clusters can be created except for the special case where the data is perfectly symmetrically distributed from the center.

To obtain the initial space, we first obtain the overall mean data position,

\(\mathbf{x}_{\text {means }}=\frac{1}{N} \sum_{i=1}^{N} \mathbf{x}_{i}\)       (6)

where 𝐱𝑖 = {𝑥1, … , 𝑥𝑑} for N d-dimensional data points. Then the vector #, which moves a point on 𝐱𝑚𝑒𝑎𝑛𝑠 to a random position within the initial space boundary, is

\(\left\{v_{i}\right\}_{i=1}^{d}=\left\{\begin{array}{cc} v_{i}=-\varepsilon \text { or } \varepsilon & i=I \\ v_{i}=\mathbb{R} \in[-\varepsilon, \varepsilon] & \text { otherwise } \end{array}\right.\)(7)

where 𝐼 ∈ {1, 2, … , 𝑑} is randomly selected, and 𝜀 is a sufficiently small positive real number(𝜀 > 0). Thus, # has value −𝜀 or 𝜀 for the Ith component, with remaining components being random real numbers between −𝜀 and 𝜀.

Consequently, the ARK-means initial centroid position is

𝝁𝑖𝑛𝑖𝑡 = 𝐱𝑚𝑒𝑎𝑛𝑠 + {𝑣𝑖}d𝑖=1. (8)

3.3 Advanced Repulsive Force k-means Algorithm

we modify the RK-means algorithm structure to prove ARK-means convergence. RK-means modified k-means centroid update (Section 2.2), whereas for ARK-means we modify the metric between clusters and data to

d(𝐱,𝝁𝑘) = {𝐱 − (𝝁𝑘 + 𝐷𝑘)}2, (9)

where x is a data vector and 𝝁 is a centroid vector. Thus, d(𝐱, 𝝁𝑘) is the square of the distance between the data 𝐱 and the position 𝝁𝑘 is pushed into due to the repulsive force from other centroids.

To prove ARK-means convergence is optimal, we substitute Eq. (9) into the cluster membership (Eq. (1)) and cost function, and use the k-means centroid update rule (Section 2.2) rather than Eq. (2). Consequently, centroids are moved to positions that minimize the changed cost function, and hence ARK-means converges to the local optimum metric (Eq. (9)) at every step. However, this tuning does not affect algorithm performance.

We also modify 𝐷𝑘 elements (Eq. (3)), which could affect algorithm performance. We redefine 𝑀𝑘 as the ratio of the error sum for the current and neighboring cluster divided by the root mean error,

\(M_{k}=\frac{J_{k} \sqrt{\frac{J_{k^{\prime}}}{N_{k^{\prime}}}}}{J_{k^{\prime}} \sqrt{\frac{J_{k}}{N_{k}}}}\)       (10)

where 𝑁𝑘 is the number of data points in the k th cluster

However, mean cluster error has become larger and mean neighboring cluster error smaller since the mass has become smaller. Therefore, cluster centroids can be moved large distances by the repulsive force via reducing the mass of cluster which is likely to include multiple data groups. And also, a cluster which is likely to include multiple data group can be dissembled by the way. Centroids also prefer to move to densely populated points. Therefore, we redefined C as

\(C=\sqrt[\alpha]{\frac{1}{\sum_{k} J_{k}}}\)       (11)

where 1 ≤ 𝛼 ∈ 𝕫, but generally 𝛼 = [1,2]; to resolve C converging to 0 when cluster error sum is too large, by increasing 𝛼.

3.4 Proposed ARK-means Algorithm

Table 1 shows the ARK-means algorithm.

Table 1. ARK-means algorithm

E1KOBZ_2020_v14n3_976_t0001.png 이미지

In the ARK-means algorithm, lines 2 and 3 describe input values, and line 5 describes the output value. Line 7 initializes return values and all of the variables used in the algorithm. Lines 8 and 9 apply the initial centroid cluster deployment strategy. Lines 11 to 14 describe the process to assign data to centroids that have the smallest metric (Eq. (9)). Line 15 updates𝑑𝑎𝑡𝑎𝑘 and 𝐸𝑘 on the cluster whenever data belongs to a centroid of the cluster. Lines 16 to 19show the deletion process for empty clusters. The k th centroid component includes a space to store data for k th cluster, such as the number of data points and error sum (line 16). Lines 20 and 21 show the process to obtain the distance each cluster is pushed by repulsive force fromall of the neighboring clusters. Lines 22 and 23 show the process to move centroids from their current location to a point with the smallest cost function. Lines 24 to 26 describe the termination condition, used in line 6; when the difference between total error sum of squares for the previous and current iteration is less than some pre-selected moderately small real number.

4. Experimental Clustering Results and Analysis

This section compares the proposed ARK-means algorithm experimentally with RK-meansand k’-means algorithms using blob and iris datasets provided by scikit-learn [26]. We focus on the known problems discussed in Section: cluster locking, proof of optimal convergence,and redefine mass and normalization to close to the general dataset. Therefore, we performed 2 experiments as follows.

1. We used the blobs dataset to fix the 2-dimensional data positions and then compared ARK-means with RK-means algorithm performance to investigate poor cluster possibility concerning different the initial k.

2. We compare ARK-means, RK-means, k-means, and k’-means algorithms for approximated k without applying domain knowledge using blobs and iris datasets for different termination conditions:

a. iteration maximum number 100,

b. square error difference < 0.01%

We also used different parameters to select the initial centroids for each algorithm:

• k-means uses the k-means++ method to stochastically select the centroid,

• k’-means uses random value, and

• ARK-means uses fixed value = 0.0025.

The experiments were performed on a PC with Intel Core i-5-35550, 3.30GHz CPU and 4GB DDR3 RAM. The ARK-means algorithm was developed in the Python 3.5.2, and random values were generated by NumPy 1.11.1 with seed = 1.

4.1 Initialization Strategy Experiment

If the blobs dataset receives the number of data N and class S, and the seed value as input values, it is possible to deploy data generated by N 2-dimensional data which comply with normal distribution, divided into S circular data group

Fig. 3 shows the likelihood for clusters using the blob dataset with class S = 4, N = 350 and initial k = 5 to 100. The result shown is the average over 1000 repeats, and poor clusters weredefined as clusters with ≤ 20 data points. RK-means poor cluster possibility is strongly sensitive to increasing k, frequently defining smaller mass clusters encircled by larger mass neighbors. In contrast, the proposed ARK-means achieved significantly lower poor cluster possibilities, particularly for larger k. Thus, the proposed initial dispersion mechanism for ARK-means successfully avoids poor clustering. This is why the ARK-means algorithm fundamentally disperses the initial point for centroids due to the initial strategy proposed in this paper.

E1KOBZ_2020_v14n3_976_f0003.png 이미지

Fig. 3. Possibility of poor clusters concerning the number of initial clusters (k) for the indicated algorithms

4.2 Evaluation of the ARK-means Algorithm

We compared ARK-means performance as follows:

1. accuracy compared with RK-means and k’-means using the blobs dataset with initial k= 5 and 10;

2. accuracy for various combinations of initial k and seed points, S, using the blobsdataset;

3. accuracy and time to complete to initial k compared with RK-means, k-means, andk’-means using the iris dataset.

4.2.1 Performance Comparison 1

We used the blobs dataset with S = 3, initial k = 5 and 10, and N = 500, 600, …, 1500. Experiments were repeated 30 times, and the presented data are the mean overall 30 repeats.

Fig. 4 shows average accuracy concerning N. All algorithms provided broadly constant average performance irrespective of N, which confirms the algorithms all provided good performance. However, k’-means exhibited the lowest performance due to the upper bound problem for k, as discussed in Section 2, and also exhibited stronger sensitivity to N. In contrast, RK-means and ARK-means changed significantly depending on initial k, withRK-means exhibiting significantly lower performance than ARK-means because RK-means strongly depends on initial centroid positions.

E1KOBZ_2020_v14n3_976_f0004.png 이미지

Fig. 4. Algorithm accuracy concerning dataset size for different initial k

The ARK-means algorithm classified all data into three clusters regardless of the N, even when initial k > 3. Thus, the proposed ARK-means resolved both the upper bound and large dataset classification problems without domain knowledge.

4.2.2 Performance Comparison 2

This experiment particularly focused on the proposed ARK-means feasibility. Therefore, we used the blobs dataset with N = 100, S = 3 to 6, and initial k = 10, 12, 14, and 16. The experiments were repeated 30 times, and presented numbers are averaged over the repeats. Table 2 shows mean accuracies for three cases: K’ = S, |K’−S| = 1, and |K’−S| > 1, where K’ is the final k from ARK-means.

Table 2. Experimental results of ARK-means accuracy using the blobs dataset

E1KOBZ_2020_v14n3_976_t0002.png 이미지

4.2.2 Performance Comparison 3

We compared ARK-means, RK-means, k-means, and k’-means using the iris dataset,comprising 503 4-dimensional feature vectors with 3 classes, setting initial k = 1 to 30, as shown in Fig. 5.

E1KOBZ_2020_v14n3_976_f0005.png 이미지

Fig. 5. Algorithm accuracy and speed using the iris dataset

Fig. 5-(a) shows that k-means achieved the lowest accuracy, which also rapidly decreased for k > 3, since poor cluster occurrence increased for larger initial k > S. RK-means achieved approximately 50% accuracy, confirming performance degradation of the initialization strategy. k’-means achieved approximately 60% accuracy, but this algorithm only found 2 classes, rather than the actual 3. Thus, k’-means did classify the data well. ARK-means achieved approximately 75% accuracy due to correcting the cluster locking problem even for large initial k. Thus, the proposed initialization strategy for ARK-means resolved cluster locking.

Fig. 5-(b) shows the algorithm time to complete for initial k. All algorithms show have similar performance to k-means, although ARK-means is slightly improved since convergence to local optimum is guaranteed and convergence speed is improved by the term obtained the formula of the ARK-means algorithm.

5. Discussion

This paper explained several current classification problems and showed that the proposed ARK-means algorithm provided appropriate solutions. However, we must consider the remaining issues with the ARK-means algorithm.

Section 4.2 showed that ARK-means accuracy decreased when the number of classes in the dataset increased, and the final number of clusters was smaller than the actual number of classes in most cases. These problems due to overlapping in the limited space, a known characteristic of the blobs dataset. A common way to resolve this problem is to distribute the data in infinite space. However, the blobs dataset circular distribution has high center cluster density, which makes it difficult to sufficiently distinguish the clusters. Therefore, we need to consider how weights and normalization terms could be more precisely defined in the algorithm and add mechanisms to distinguish clusters with overlapping edges in different densities. These problems are not unique to the proposed ARK-means algorithm and remain generally challenging tasks to improve classification performance

Another problem is to define the exact dataset class from a data group linearly distributed in space. For ARK-means, more data groups linearly distributed in space make it more difficult to find the correct dataset class, i.e., ARK-means struggles to fill the data group by reducing the total cluster mass due to data on the elongated side being a significantly larger distance from the cluster centroid. This leads to misclassifications since ARK-means define repulsive force based on distance. Therefore, we need to consider another criterion to define repulsive forces and hence resolve this problem, e.g. similarity, which will be somewhat challenging.

ARK-means struggles to identify approximate classes for outliers in the dataset, although outliers do not significantly affect selecting the approximate k. Outliers may occur in different data groups or completely separate from all data groups. For the former case, ARK-means could deal with outliers as an integrated cluster, and hence this has little effect on the selecting approximate k; whereas for the latter case, ARK-means recognizes the outlier as a separate cluster, and we could find a suitable approximate k by removing the outlier using filtering based on the number of data points in a cluster. Hence, outliers in the dataset should not affect dataset clustering without domain knowledge, which is the primary ARK-means purpose.

Finally, Although the proposed ARK-means algorithm provides approximate k value without domain knowledge, it has weak points in practice. As shown in Fig. 4 and Fig. 5, jitter is generated by random seed when blob data is created by changing all of the distance between data every time. And also, the ARK-means algorithm has performance deviation according to k value given as initial value. Consequently, we can know the ARK-means algorithm has a different performance deviation by location of initial points. To resolve this phenomenon, a common way is to define initial k value based on several simulations of inputted data in a specific service domain. However, this way has a higher cost and time. Furthermore, we cannot know changing characteristics of data including distance and average. Accordingly, it is also a challenging task.

6. Conclusion

This paper proposed the ARK-means algorithm to resolve three specific significant problems with the RK-means algorithm. We performed three sets of experiments to verify ARK-means feasibility, with ARK-means achieving the highest performance compared with currentk-means, k’-means, and RK-means algorithms. We also discussed remaining practical ARK-means issues and challenging tasks to resolve them. Therefore, we can conclude with confidence that the proposed ARK-means algorithm will improve classification performance for large datasets, selecting initial k without domain knowledge. Although we improve classification performance based on proposed our algorithm, there are remain challenge tasks to apply the method in practice. To this end, future research will include statistics based on selecting an initial point to reduce performance deviation in practice.

참고문헌

  1. Y.-M. Cheung, "On rival penalization controlled competitive learning for clustering with automatic cluster number selection," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 11, pp. 1583-1588, 2005. https://doi.org/10.1109/TKDE.2005.184
  2. M. E. Celebi, H. A. Kingravi, Patricio A. vela, "A comparative study of efficient initialization methods for the k-means clustering algorithm," Expert Systems with Applications, vol 40, no. 1, pp. 200-210, 2013. https://doi.org/10.1016/j.eswa.2012.07.021
  3. T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, A. Y. Wu "An Efficient k-Means Clustering Algorithm: Analysis and Implementation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 881-892, 2002. https://doi.org/10.1109/TPAMI.2002.1017616
  4. D. Arthur, S. Vassilvitskii. "k-means++: the advantages of careful seeding," Nikhil Bansal, Kirk Pruhs, and Clifford Stein, editors, SODA, SIAM, pp.1027-1035, 2007.
  5. C. Fang, W. Jin, J. Ma, "k'-Means algorithms for clustering analysis with frequency sensitive discrepancy metics," Pattern Recognition Letters, vol. 34, no. 5, pp. 580-586, Apr. 2013. https://doi.org/10.1016/j.patrec.2012.11.004
  6. H. Khatter, V. Aggarwal, A. K. Ahlawat, "Performance Analysis of the Competitive Learning Algorithms on Gaussian Data in Automatic Cluster Selection," in Proc. of IEEE International Conference on Computational Intelligence & Communication Technology (CICT), U.P., India, 12-13 Feb. 2016.
  7. K. Wagstaff, C. Cardie, S. Rogers, S. Schroedl, "Constrained K-means Clustering with Background Knowledge," in Proc. of the Eighteenth International Conference on Machine Learning, pp. 577-584, 2001.
  8. A. K. Jain, "Data clustering: 50 years beyond K-means," Pattern Recognition Letters, vol. 31, no. 8, pp. 651-666, June 2010. https://doi.org/10.1016/j.patrec.2009.09.011
  9. J. Baek, Y. Kim, B. Chung, C. Yim, "Linear Spectral Clustering with Contrast-limited Adaptive Histogram Equalization for Superpixel Segmentation," IEIE Transactions on Smart Processing and Computing, vol. 8, no. 4, pp. 255-264, August, 2019. https://doi.org/10.5573/IEIESPC.2019.8.4.255
  10. J. Lei, T. Jiang, K. Wu, H. Du, G. Zhu, Z. Wang, "Robust K-means algorithm with automatically splitting and merging clusters and its applications for surveillance data," Multimedia Tools and Applications, vol. 75, no. 19, pp. 12043-12059, Oct. 2016. https://doi.org/10.1007/s11042-016-3322-5
  11. D. T. Pham, S. S. Dimov, C. D. Nguyen, "Selection of K in K-means clustering," in Proc. of ImechE, vol. 2019 Part C: Journal of Mechanical Engineering Science, vol. 219, pp. 103-119, 2005.
  12. T. M. Kodinariya, P. R. Makwana, "Review on determining number of Cluster in K-Means Clustering," International Journal of Advance Research in Computer Science and Management Studies, vol. 1, no. 6, pp. 90-95, 2013.
  13. P. S. Bradley, U. M. Fayyad, "Refining Initial Points for K-Means Clustering," in Proc. of the 15th International Conference on Machine Learning (ICML98), San Francisco, USA, pp. 91-99, 1998.
  14. D. Pelleg, A. Moore, "X-means: Extending K-means with Efficient Estimation of the Number of Clusters," in Proc. of the 17th International Conference on Machine Learning (ICML2000), pp. 727-734, 2000.
  15. M. K. Ng, J. Z. Huang, L. Jing, "An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data," IEEE Transactions on Knowledge & Data Engineering, vol. 19, pp. 1026-1041, 2007. https://doi.org/10.1109/TKDE.2007.1048
  16. J. Luo, J. Ma, "Image segmentation with the competitive learning-based MS model," in Proc. of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec, Canada, 27-30 Sep. 2015.
  17. Y.-M. Cheung, "Rival penalization controlled competitive learning for data clustering with unknown cluster number," in Proc. of the 9th International Conference on Neural Information Process (ICONIP '02), Singapore, 18-22 Nov. 2002.
  18. J. Ma, T. Wang, "A cost-function approach to rival penalized competitive learning (RPCL)," IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 36, no. 4, pp. 722-737, 2006. https://doi.org/10.1109/TSMCB.2006.870633
  19. H. Xie, X. Luo, C. Wang, S. Liu, X. Xu, X. Tong, "Multispectral remote sensing image segmentation using rival penalized controlled competitive learning and fuzzy entropy," Soft Computing, vol. 20, no. 12, pp. 4709-4722, Dec. 2016. https://doi.org/10.1007/s00500-015-1601-0
  20. J. MacQueen, "Some methods for classification and analysis of multivariate observations," in Proc. of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, California, USA, 16 Jan. 2008.
  21. J. Qin, W. Fu, H. Gao, W. X. Zheng, "Distributed k-means algorithm and fuzzy c-means algorithm for sensor networks based on multiagent consensus theory," IEEE Transactions on Cybernetics, vol. 47, no. 3, pp. 772-783, Mar. 2016. https://doi.org/10.1109/TCYB.2016.2526683
  22. L. Xu, A, Krzyzak, E. Oja, "Rival penalized competitive learning for clustering analysis, RBF net, and curve detection," IEEE Transactions on Neural networks, vol. 4, no. 4, pp. 636-649, 1993. https://doi.org/10.1109/72.238318
  23. G.-J. Yu, K.-Y. Yeh, "A K-Means Based Small Cell Deployment Algorithm for Wireless Access Networks," in Proc. of 2016 International Conference on Networking and Network Applications (NaNA), Hokkaido, Japan, 23-25 July 2016.
  24. K. R. Zalik, "An efficient k'-means clustering algorithm," Pattern Recognition Letters, vol. 29, no. 9, pp. 1385-1391, 2008. https://doi.org/10.1016/j.patrec.2008.02.014
  25. D. Arthur, S. Vassilvitskii, "k-means++: the advantages of careful seeding," in Proc. of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp. 1027-1035, 2007.
  26. Scikit-learn, "Machine Learning in Python," Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.