K-Means Clustering
K-means Clustering
Definition:
K-Means clustering intends to partition n objects into k clusters in which each object belongs to the cluster with the nearest mean. This method produces exactly k different clusters of greatest possible distinction.
K-means clustering is a machine learning clustering technique used to simplify large datasets into smaller and simple datasets. Distinct patterns are evaluated and similar data sets are grouped together. The variable K represents the number of groups in the data.
Steps to find K-means clustering:
1) Take a random mean value.
2) Find nearest number of mean and put in a cluster.
3) Repeat one and two until we get the same mean.
4) Stop.
Example:
K= {2,3,4,10,11,12,20,25,30}
K=2
K1= {2,3,4} K2= {10,11,12,20,25,30}
M1=3 M2=18
K1= {2,3,4,10} K2= {11,12,20,25,30}
M1=4.75=5 M2=19.6=20
K1= {2,3,4,10,11,12} K2= {20,25,30}
M1=7 M2=25
K1= {2,3,4,10,11,12} K2= {20,25,30}
M1=7 M2=25
Thus we are getting the same mean we have to stop.
i.e. K1= {2,3,4,10,11,12}
K2= {20,25,30}
Advantages and disadvantages of k-means clustering:
Advantages:
1) Simple: It is easy to implement k-means and identify unknown groups of data from complex data sets.
2) Flexible: K-means algorithm can easily adjust to the changes. If there are any problems, adjusting the cluster segment will allow changes to easily occur on the algorithm.
3) Time complexity: K-means segmentation is linear in the number of data objects thus increasing execution time.
4) Easy to interpret: The results are easy to interpret. It generates cluster descriptions in a form minimized to ease understanding of the data.
5) Accuracy: K-means analysis improves clustering accuracy and ensures information about a particular problem domain is available. Modification of the k-means algorithm based on this information improves the accuracy of the clusters.
Disadvantages:
1) Lacks consistency: K-means clustering gives varying results on different runs of an algorithm. A random choice of cluster patterns yields different clustering results resulting in inconsistency.
2) Sensitivity to scale: Changing or rescaling the dataset either through normalization or standardization will completely change the final results.
3) Crash computer: When dealing with a large dataset, conducting a dendrogram technique will crash the computer due to a lot of computational load and Ram limits.
4) Handle numerical data: K-means algorithm can be performed in numerical data only.
5) Prediction issues: It is difficult to predict the k-values or the number of clusters. It is also difficult to compare the quality of the produced clusters.