K-Means Clustering

Anjali shukla
2 min readOct 28, 2020

--

K-means Clustering

Definition:

K-Means clustering intends to partition n objects into k clusters in which each object belongs to the cluster with the nearest mean. This method produces exactly k different clusters of greatest possible distinction.

K-means clustering is a machine learning clustering technique used to simplify large datasets into smaller and simple datasets. Distinct patterns are evaluated and similar data sets are grouped together. The variable K represents the number of groups in the data.

Steps to find K-means clustering:

1) Take a random mean value.

2) Find nearest number of mean and put in a cluster.

3) Repeat one and two until we get the same mean.

4) Stop.

Example:

K= {2,3,4,10,11,12,20,25,30}

K=2

K1= {2,3,4} K2= {10,11,12,20,25,30}

M1=3 M2=18

K1= {2,3,4,10} K2= {11,12,20,25,30}

M1=4.75=5 M2=19.6=20

K1= {2,3,4,10,11,12} K2= {20,25,30}

M1=7 M2=25

K1= {2,3,4,10,11,12} K2= {20,25,30}

M1=7 M2=25

Thus we are getting the same mean we have to stop.

i.e. K1= {2,3,4,10,11,12}

K2= {20,25,30}

Figure 1.1 — Cluster Analysis

Advantages and disadvantages of k-means clustering:

Advantages:

1) Simple: It is easy to implement k-means and identify unknown groups of data from complex data sets.

2) Flexible: K-means algorithm can easily adjust to the changes. If there are any problems, adjusting the cluster segment will allow changes to easily occur on the algorithm.

3) Time complexity: K-means segmentation is linear in the number of data objects thus increasing execution time.

4) Easy to interpret: The results are easy to interpret. It generates cluster descriptions in a form minimized to ease understanding of the data.

5) Accuracy: K-means analysis improves clustering accuracy and ensures information about a particular problem domain is available. Modification of the k-means algorithm based on this information improves the accuracy of the clusters.

Disadvantages:

1) Lacks consistency: K-means clustering gives varying results on different runs of an algorithm. A random choice of cluster patterns yields different clustering results resulting in inconsistency.

2) Sensitivity to scale: Changing or rescaling the dataset either through normalization or standardization will completely change the final results.

3) Crash computer: When dealing with a large dataset, conducting a dendrogram technique will crash the computer due to a lot of computational load and Ram limits.

4) Handle numerical data: K-means algorithm can be performed in numerical data only.

5) Prediction issues: It is difficult to predict the k-values or the number of clusters. It is also difficult to compare the quality of the produced clusters.

--

--

No responses yet