K-means clustering is an unsupervised learning algorithm, and out of all the unsupervised learning algorithms, K-means clustering might be the most widely used, thanks to its power and simplicity. How does K-means clustering work exactly?
The short answer is that K-means clustering works by creating a reference point (a centroid) for a desired number of classes, and then assigning data points to class clusters based on which reference point is closest. While that’s a quick definition for K-means clustering, let’s take some time to dive deeper into K-means clustering and get a better intuition for how it operates.
Before we examine the exact algorithms used to carry out K-means clustering, let’s take a little time to define clustering in general.
- Ensure all data points in a cluster are as similar to each other as possible.
- Ensure all data points in different clusters are as dissimilar to each other as possible.
Clustering algorithms group items together based on some metric of similarity. This is often done by finding the “centroid” of the different possible groups in the dataset, though not exclusively. There are a variety of different clustering algorithms but the goal of all the clustering algorithms is the same, to determine the groups intrinsic to a dataset.
K-Means Clustering is one of the oldest and most commonly used types of clustering algorithms, and it operates based on vector quantization. There is a point in space picked as an origin, and then vectors are drawn from the origin to all the data points in the dataset.
In general, K-means clustering can be broken down into five different steps:
- Place all instances into subsets, where the number of subsets is equal to K.
- Find the mean point/centroid of the newly created cluster partitions.
- Based on these centroids, assign each point to a specific cluster.
- Calculate the distances from every point to the centroids, and assign points to the clusters where the distance from centroid is the minimum.
- After the points have been assigned to the clusters, find the new centroid of the clusters.
The above steps are repeated until the training process is finished.
Alternatively, after the centroids have been placed, we can conceive of K-means clustering as swapping back and forth between two different phases: labeling data points and updating centroids.
In the data point labeling phase, every data point is assigned a label that places it in the cluster belonging to the nearest centroid. The nearest centroid is typically determined using squared Euclidean distance, although other distance metrics such as Manhattan distance, Cosine, and Jaccard distance can be used depending on the type of data being fed into the clustering algorithm.
In the centroid update step, the centroid are calculated by finding the mean distance between all the data points currently contained within a cluster.
How to Choose the Right Value for “K”
Considering that K-means clustering is an unsupervised algorithm and the number of classes isn’t known in advance, how do you decide on the appropriate number of classes/the right value for K?
One technique for selecting the right K-value is called the “the elbow technique”. The elbow technique consists of running a K-means clustering algorithm for a range of different K-values and using an accuracy metric, typically the Sum of Squared Error, to determine which values of K give the best results. The Sum of Squared Error is determined by calculating the mean distance between the centroid of a cluster and the data points in that cluster.
The term “elbow technique” comes from the fact that when you plot the SSE with regard to the different values of K, the resulting line plot will often have an “elbow” shape, where SSE decreases rapidly for the first few values of K, but then levels off. In such conditions, the value of K located at the elbow is the best value for K, as there are rapidly diminishing returns after this value.
Mini-Batch K-Means Clustering
As datasets grow larger, computation time grows as well. Basic K-means clustering can take a long time to complete when run on massive datasets, and as a result, tweaks to K-means clustering have been made to enable to reduce the algorithm’s spatial and temporal costs.
Mini-Batch K-means clustering is a variant on K-means clustering where the size of the dataset being considered is capped. Normal K-means clustering operates on the entire dataset/batch at once, while Mini-batch K-means clustering breaks the dataset down into subsets. Mini-batches are randomly sampled from the entire dataset and for each new iteration a new random sample is selected and utilized to update the position of the centroids.
In Mini-Batch K-Means clustering, clusters are updated with a combination of the mini-batch values and a learning rate. The learning rate decreases over the iterations, and it’s the inverse of the number of data points placed in a specific cluster. The effect of reducing the learning rate is that the impact of new data is reduced and convergence is achieved when, after several iterations, there are no changes in the clusters.
Results of studies on the effectiveness of Mini-batch K-means clustering suggest that it can successfully reduce computation time with a slight trade-off in cluster quality.
Applications of K-Means Clustering
K-means clustering can safely be used in any situation where data points can be segmented into distinct groups/classes. Here are some examples of common use cases for K-mean clustering.
K-means clustering could be applied to document classification, grouping documents based on features like topics, tags, word usage, metadata and other document features. It could also be used to classify users as bots or not bots based on patterns of activity like posts and comments. K-means clustering can also be used to put people into groups based on levels of concern when monitoring their health, based on features like comorbidities, age, patient history, etc.
K-means clustering can also be used for more open-ended tasks like creating recommendation systems. Users of a system like Netflix can be grouped together based on viewing patterns and recommended similar content. K-means clustering could be used for anomaly detection tasks, highlighting potential instances of fraud or defective items.