K-means clustering is one of the oldest and most popular clustering methods. As an unsupervised machine learning algorithm, it’s perfect for data mining, or finding patterns in large amounts of unlabelled data. Lucky for us, k-means is easy to use and relatively accurate.
Read on to learn more about k-means clustering and why it’s a perfect unsupervised machine learning algorithm for beginners.
What is k-means clustering?
K-means clustering is an unsupervised machine learning algorithm that creates clusters within your data, which can help you to discover categories or groups that you might not have seen on your own.
Let’s start with the basics. As the name suggests, k-means clustering is a type of clustering algorithm.
These types of algorithm are used to discover inherent groupings (or clusters) within a dataset and can be very powerful tools. Searching for patterns in large data sets is also known as data mining.
Clustering algorithms are examples of unsupervised learning where you have input data (x) but no output variables – you don’t necessarily know ahead of time what the algorithm will discover in your data. There’s no right or wrong answer when it comes to unsupervised learning, the algorithm is left to discover any interesting structures that are already present within the data.
For instance, you might use a clustering algorithm to:
- Identify handwritten numbers
- Segment customers
- Classify documents
- Categorize inventory
- Detect anomolies
- Compress colors in images
In contrast, a supervised machine learning algorithm, such as linear regression, has clearly defined input and output variables, such as the size of a house (x) and its market value (y).
K-means is a simple, fast, and efficient algorithm that searches for a predetermined number of clusters within an unlabeled set of data. It’s based on the assumption that each cluster has a center, or ‘centroid’, and that each point within the dataset is closer to its own cluster center than it is to other cluster centers.
How k-means clustering works
To understand how the k-means algorithm works let’s look at a two-dimensional scatterplot of some random data.
Within this dataset we have two variables plotted on the x and y axis. We haven’t labeled either of the axes because this is an unsupervised machine learning algorithm and we don’t necessarily know what these are, for now we’re just looking for the underlying patterns within the data.
Our goal in using the k-means clustering algorithm is to answer the question of whether we can identify groups based on certain variables.
Now, it’s probably pretty easy to pick out three clusters in the above scatterplot by sight alone, however, this also happens to be quite a simple example of a dataset with only two variables.
What about when you have tens, or even hundreds, of different variables?
K-means will allow you to easily identify groups within your data, simplifying an otherwise complex task.
If you want to learn more, then read on my friend because we’re going to walk through how k-means clustering works step-by-step:
Step 1 • Choose the number of clusters (k)
The first step when running a k-means clustering algorithm is to choose the number of clusters that we want the algorithm to identify.
Sometimes this is pretty obvious, like in our earlier example where we can see three distinct groups. Most of the time, however, it isn’t nearly this straightforward.
Luckily, there’s an easy, data-backed way to determine the optimal number of clusters – we’ll talk more about this later 👇
For now, let’s tell the algorithm to find 3 clusters (k = 3).
Step 2 • Select the random points that will be the centroids of your clusters
Next, we need to determine where our centroids will be. We’ll need the same number of centroids as clusters. So in our example, we’d have 3 centroids.
As far as where these centroids are located, we can select any point on our scatterplot – they don’t have to be from our dataset.
Step 3 • Assign each data point to the closest centroid
Here’s where the k-means magic begins to happen.
For each data point in our dataset, we’ll need to identify which of the chosen centroids that it’s closest to.
The k-means algorithm calculates the distance between the data points and centroids quickly and efficiently, however, if we were to calculate the clusters ourselves, we could draw a perpendicular line equidistant between two centroids and then visually determine which centroid a data point was closest to.
Now, “closest” is a vague term. There are multiple ways of measuring distance, however for our purposes, we’re talking about Euclidean distance. This can be defined as the straight-line distance between two points.
Described another way, the Euclidean distance between points p and q is the length of the line segment connecting them.
So, for the k-means clusters algorithm, the distance between a data point and the different centroids is measured. Each data point is then assigned to the nearest centroid.
This creates a cluster.
Step 4 • Compute and place a new centroid for each cluster
Now that our algorithm has identified the clusters within our data, we’ll need to recalculate the centroids so they sit within the center of each cluster.
If we imagine that the centroids are weightless, while each data point has a mass (e.g. 1 lb or 1 kg) we want to find the center of gravity – or average – of all the data points.
Once we determine this, we can move each centroid so that it sits within the center of our identified clusters.
Step 5 • Re-assign each data point to the new closest centroid and if any movement took place, repeat step 4
Now that our centroids have moved, we’ll need to perform step 3 again and re-assign each data point to its nearest centroid. If any data points moved, we’ll need to repeat step 4 again calculating the center of gravity for each centroid.
Repeat steps 3 and 4 until your algorithm converges. This is the point at which there are no more data points that need to be re-assigned.
Remove the centroids and you now have your final results!
Finding the optimal number of clusters
Since we have to tell the algorithm how many clusters to look for, it’s important to know the optimal number of clusters for your specific dataset. To do this, you can use the within cluster sum of squares (WCSS) formula.
Let’s break WCSS down:
In the above formula we have three elements – one for each potential cluster. Remember, we’re searching for the optimal number of clusters so the number of elements will change.
We’re going to calculate the sum for each cluster and also within each cluster (hence the W in WCSS).
Let’s start by looking at cluster 1.
We can take a sum across the data points within the cluster, summing the distance between each point inside cluster 1 and the centroid of cluster 1. Then, we’ll square the distance. This means we’re taking the sum of those squared distances.
Once we’ve taken the sum of all the squares of all these distances, we’ll sum this up as well.
Do this for each cluster. The total sum is what we use to determine the optimal number of clusters.
If we were to do WCSS for just one cluster and measure the distance between every single data point and centroid, then square that distance and add them all up we’d get a really large number because the centroid would be far away from many points.
For each centroid that you add, your total distance decreases because there are more centroids that can therefore, be nearer to more data points.
If we were to assign a centroid to every single datapoint then the total distance would be 0 – there would be no distance between a centroid and data point. So the WCSS metric is good, but at the same time it’s constantly decreasing and producing a better “fit”.
So how do you tell the optimal number of centroids?
By plotting the number of clusters and the total WCSS distance.
When we do this, we can see the optimal number of clusters is when the total WCSS drops radically. This indicates that you won’t see significantly better results if you were to add more clusters.