Table of Contents
The Problem
You are an analyst of a Software company and are assigned to explore the market for the new fitness app your company is developing. From a dataset of height and weight of the population, you have to find distinct groups of people. To make things easier, your manager asked to find 3 groups of people of similar (closest) height-weight pairs.
To solve this kind of exploratory analysis problem, the simplest yet elegant tool is the K-Means Clustering Algorithm. This algorithm has a wide variety of use cases in areas like customer segmentation, social network analysis, fraud detection, text clustering (NLP), fraud detection, traffic analysis, quality control, and so on.
How it works – Step by Step
Before defining the algorithm formally, let’s try to understand how this algorithm works.
Let’s say, we have a dataset of the height and width of a population. They are plotted in the X and Y axes as shown below. The dotted boundaries show the actual clusters of this dataset (which we do not know yet!). We will find them by applying the K-Means Clustering Algorithm.
We see there are 3 clusters, which means K = 3 (yes, in K-Means, K is the number of clusters we want to explore!)
In the first step, the algorithm will assign K random points in the population. In this case, we assign 3 random points c0, c1, and c2 as below. Usually, these points are called centroids.
Next, we select a point from the dataset. In the image below, we are selecting d0.
Now calculate the distance from d0 to each centroid and find the minimum distance. It is clear that centroid c0 is the closest to d0.
Now is the most fun part!
- Do the distance calculation process for each point of the dataset.
- Assign the closest point to their respective centroids. In other words,
- centroid c0 => is assigned with points closest to it.
- centroid c1 => is assigned with points closest to it.
- centroid c2 => is assigned with points closest to it.
- Calculate the average X and Y for each point group. Let’s say c0 has points d0 (X0, Y0), d1(X1, Y1) and d2(X2, Y2). The average X = (X0 + X1 + X2) / 3 and average Y = (Y0 + Y1 + Y2) / 3.
- Update the coordinates of each centroid with their respective average X and Y. After updates, each centroid might change its location like the image below.
After doing the centroid update process multiple times, we will find the population nicely clustered as below.
Randomness, Iterations, and Cluster Quality
We explained the algorithm with a nicely oriented dataset. But in the real world, this is mostly untrue. Real-world datasets can be messy. Also, our initial centroid is born by random number generation. What if they all are close to each other in the first place? If so, they will erroneously cluster our population as shown below.
To avoid such errors:
- Do multiple iterations of the algorithm. In each iteration, freshly assigned centroids will start the process.
- For each centroid, calculate the total distance between itself and its cluster data points. When the cluster is nearly perfect, this distance is minimal.
Formal Definition – What is K-Means Clustering?
Hopefully, we understand the algorithm and at this point, we might be good to have a formal definition!
So, K-means clustering is a method used in data analysis and machine learning to group similar data points into clusters.
It aims to partition a set of data points into K clusters, where each data point belongs to the cluster with the nearest mean (centroid). The “K” in K-means refers to the number of clusters predefined by the user. The algorithm iteratively assigns data points to the nearest cluster based on their features and recalculates the centroids until the clusters’ positions stabilize, typically minimizing the variance within each cluster.
K-Means In Action – Image Segmentation
A cool application of K-Means clustering is image segmentation. Let’s say, we have a grayscale image as below. A grayscale image has pixel values between 0 and 255. If we apply K-Means with K = 2, all the pixels will fall under two groups.
We can then assign each group of pixels a certain value to generate the segmented version of the image.
For example, for 2 clusters, we might assign 0 for one cluster and 255 for the other. This way we will get a segmented image like below.
Here are some examples of different levels of segmenting with different K values. Interested learners can tweak this code used for image segmentation.
Want to Learn More?
This tutorial did not cover everything. For example, how do we choose the optimal value of K? You can find the missings exploring below references.
Happy Clustering!