K-Means Clustering in R: Step-by-Step Example (2024)

Clustering is a technique in machine learning that attempts to find clusters of observations within a dataset.

The goal is to find clusters such that the observations within each cluster are quite similar to each other, while observations in different clusters are quite different from each other.

Clustering is a form of unsupervised learning because we’re simply attempting to find structure within a dataset rather than predicting the value of some response variable.

Clustering is often used in marketing when companies have access to information like:

Household income
Household size
Head of household Occupation
Distance from nearest urban area

When this information is available, clustering can be used to identify households that are similar and may be more likely to purchase certain products or respond better to a certain type of advertising.

One of the most common forms of clustering is known ask-means clustering.

What is K-Means Clustering?

K-means clustering is a technique in which we place each observation in a dataset into one of K clusters.

The end goal is to haveKclusters in which the observations within each cluster are quite similar to each other while the observations in different clusters are quite different from each other.

In practice, we use the following steps to perform K-means clustering:

1. Choose a value forK.

K-Means Clustering in R

The following tutorial provides a step-by-step example of how to perform k-means clustering in R.

Step 1: Load the Necessary Packages

First, we’ll load two packages that contain several useful functions for k-means clustering in R.

library(factoextra)library(cluster)

Step 2: Load and Prep the Data

For this example we’lluse the USArrestsdataset built into R, which contains the number of arrests per 100,000 residents in each U.S. state in 1973 for Murder, Assault, and Rape along with the percentage of the population in each state living in urban areas, UrbanPop.

The following code shows how to do the following:

Load the USArrests dataset
Remove any rows with missing values
Scale each variable in the dataset to have a mean of 0 and a standard deviation of 1

#load datadf <- USArrests#remove rows with missing valuesdf <- na.omit(df)#scale each variable to have a mean of 0 and sd of 1df <- scale(df)#view first six rows of datasethead(df) Murder Assault UrbanPop RapeAlabama 1.24256408 0.7828393 -0.5209066 -0.003416473Alaska 0.50786248 1.1068225 -1.2117642 2.484202941Arizona 0.07163341 1.4788032 0.9989801 1.042878388Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602California 0.27826823 1.2628144 1.7589234 2.067820292Colorado 0.02571456 0.3988593 0.8608085 1.864967207

Step 3: Find the Optimal Number of Clusters

To perform k-means clustering in R we can use the built-in kmeans() function, which uses the following syntax:

kmeans(data, centers, nstart)

where:

data: Name of the dataset.
centers: The number of clusters, denoted k.
nstart: The number of initial configurations. Because it’s possible that different initial starting clusters can lead to different results, it’s recommended to use several different initial configurations. The k-means algorithm will find the initial configurations that lead to the smallest within-cluster variation.

Since we don’t know beforehand how many clusters is optimal, we’ll create two different plots that can help us decide:

Step 4: Perform K-Means Clustering with Optimal K

Lastly, we can perform k-means clustering on the dataset using the optimal value fork of 4:

#make this example reproducibleset.seed(1)#perform k-means clustering with k = 4 clusterskm <- kmeans(df, centers = 4, nstart = 25)#view resultskmK-means clustering with 4 clusters of sizes 16, 13, 13, 8Cluster means: Murder Assault UrbanPop Rape1 -0.4894375 -0.3826001 0.5758298 -0.261653792 -0.9615407 -1.1066010 -0.9301069 -0.966763313 0.6950701 1.0394414 0.7226370 1.276939644 1.4118898 0.8743346 -0.8145211 0.01927104Clustering vector: Alabama Alaska Arizona Arkansas California Colorado 4 3 3 4 3 3 Connecticut Delaware Florida Georgia Hawaii Idaho 1 1 3 4 1 2 Illinois Indiana Iowa Kansas Kentucky Louisiana 3 1 2 1 2 4 Maine Maryland Massachusetts Michigan Minnesota Mississippi 2 3 1 3 2 4 Missouri Montana Nebraska Nevada New Hampshire New Jersey 3 2 2 3 2 1 New Mexico New York North Carolina North Dakota Ohio Oklahoma 3 3 4 2 1 1 Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee 1 1 1 4 2 4 Texas Utah Vermont Virginia Washington West Virginia 3 1 2 1 1 2 Wisconsin Wyoming 2 1 Within cluster sum of squares by cluster:[1] 16.212213 11.952463 19.922437 8.316061 (between_SS / total_SS = 71.2 %)Available components:[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" [7] "size" "iter" "ifault"

From the results we can see that:

16 states were assigned to the first cluster
13 states were assigned to the second cluster
13 states were assigned to the third cluster
8 states were assigned to the fourth cluster

We can visualize the clusters on a scatterplot that displays the first two principal components on the axes using thefivz_cluster() function:

#plot results of final k-means modelfviz_cluster(km, data = df)

We can also use theaggregate() function to find the mean of the variables in each cluster:

#find means of each clusteraggregate(USArrests, by=list(cluster=km$cluster), mean)cluster Murder AssaultUrbanPop Rape13.60000 78.5384652.0769212.17692210.81538 257.3846276.0000033.1923135.65625 138.8750073.8750018.78125413.93750 243.6250053.7500021.41250

We interpret this output is as follows:

The mean number of murders per 100,000 citizens among the states in cluster 1 is 3.6.
The mean number of assaults per 100,000 citizens among the states in cluster 1 is 78.5.
The mean percentage of residents living in an urban area among the states in cluster 1 is 52.1%.
The mean number of rapes per 100,000 citizens among the states in cluster 1 is 12.2.

And so on.

We can also append the cluster assignments of each state back to the original dataset:

#add cluster assigment to original datafinal_data <- cbind(USArrests, cluster = km$cluster)#view final datahead(final_data) MurderAssaultUrbanPop Rape clusterAlabama 13.223658 21.2 4Alaska 10.026348 44.5 2Arizona 8.129480 31.0 2Arkansas 8.819050 19.5 4California 9.027691 40.6 2Colorado 7.920478 38.7 2

Pros & Cons of K-Means Clustering

K-means clustering offers the following benefits:

It is a fast algorithm.
It can handle large datasets well.

However, it comes with the following potential drawbacks:

It requires us to specify the number of clusters before performing the algorithm.
It’s sensitive to outliers.

Two alternatives to k-means clustering are k-medoids clustering and hierarchical clustering.

You can find the complete R code used in this example here.