\(k\)-means is a simple iterative algorithm for linearly separable data. It is a partitioning method that divides the data into \(k\) clusters.
We will use it to cluster the Old Faithful gazer data set.
# Load the data
data("faithful")
head(faithful)
## eruptions waiting
## 1 3.600 79
## 2 1.800 54
## 3 3.333 74
## 4 2.283 62
## 5 4.533 85
## 6 2.883 55
The data set contains two variables: eruptions
and
waiting
. We will use these two variables to cluster the
data.
cl2 <- kmeans(faithful, centers = 2)
table(cl2$cluster)
##
## 1 2
## 100 172
The object cl2$cluster
contains the cluster assignments
for each observation. The table
function shows the number
of observations in each cluster. Moreover, cl2$centers
contains the cluster centers and cl2$withinss
contains the
within-cluster sums of squares.
print(cl2$centers) # Cluster centers
## eruptions waiting
## 1 2.09433 54.75000
## 2 4.29793 80.28488
print(cl2$withinss) # Within-cluster sums of squares
## [1] 3456.178 5445.591
print(cl2$tot.withinss) # Total sum of squared distances = sum(cl2$withinss)
## [1] 8901.769
plot(faithful, col = cl2$cluster)
points(cl2$centers, col = 1:2, pch = 8, cex = 5)
It is quite clear that there are 2 clusters, but how can we choose appropriate \(k\) if we didn’t know it?
Calculate \(k\)-means for different \(k\) and plot \(SS\) vs \(k\).
SS <- rep(0, 10)
for (k in 1:10) {
cl <- kmeans(faithful, centers = k)
SS[k] <- cl$tot.withinss
}
plot(1:10, SS, type = "b", xlab = "k", ylab = "SS")
A good thing to do is to calculate the distance between cluster centers. And for each cluster calculate the average distance to its cluster center. If the average distance is smaller than the distance between cluster centers, then the clustering is compact and good.
We can define a function to do this.
average_distance <- function(data, center) {
n <- nrow(data)
# add the center to the data as the last row
# get the distance matrix and sum the last row
total <- sum(as.matrix(dist(rbind(data, center)))[n + 1, ])
return(total / n)
}
avg_dist_to_cluster_centers <- function(data, centers, clusters) {
k <- nrow(centers)
avg_dist <- rep(0, k)
for (i in 1:k) {
cluster_data <- data[which(clusters == i), ]
avg_dist[i] <- average_distance(cluster_data, centers[i, ])
}
return(avg_dist)
}
We can now see how the clustering looks for \(k = 2\). First we calculate the distance between cluster centers.
dist(cl2$centers)
## 1
## 2 25.62979
And now the average distance to cluster centers.
avg_dist_to_cluster_centers(faithful, cl2$centers, cl2$cluster)
## [1] 4.899079 4.556494
We can see that the clusters are vary compact.
Let’s try switching to \(k = 3\) and see how the clustering looks.
cl3 <- kmeans(faithful, centers = 3)
plot(faithful, col = cl3$cluster)
points(cl3$centers, col = 1:3, pch = 8, cex = 5)
print(dist(cl3$centers))
## 1 2
## 2 17.25247
## 3 29.94067 12.69574
avg_dist_to_cluster_centers(faithful, cl3$centers, cl3$cluster)
## [1] 4.080581 4.073555 3.023563
The silhouette width is a measure of how similar an object is to its own cluster compared to the second nearest cluster. The silhouette width ranges from -1 to 1. High value means that the clustering is well separated, values around 0 mean that the clusters are probably overlapping and negative numbers mean that the data point has probably been misclassified.
library("cluster")
dist_mat <- dist(faithful)^2
silhouette_cl2 <- silhouette(cl2$cluster, dist_mat)
summary(silhouette_cl2)
## Silhouette of 272 units in 2 clusters from silhouette.default(x = cl2$cluster, dist = dist_mat) :
## Cluster sizes and average silhouette widths:
## 100 172
## 0.8648855 0.8877335
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1238 0.8915 0.9336 0.8793 0.9489 0.9556
plot(silhouette_cl2, col = 1:2, border = NA)
Here we constructed a distance matrix dist_mat
that
recorded the squared Euclidean distance between observations, which is
the distance metric used in \(k\)-means
clustering. This distance matrix was then used along with the cluster
labels cl2$cluster
to compute the silhouette width for all
observations. Overall we see high average silhouette widths for both
clusters (0.86 and 0.89) for Clusters 1 and 2 respectively. Plotting the
silhouette width for all observations also shows that a majority of
observations have high silhouette widths.
hcl <- cutree(hclust(dist(faithful)), 2)
icl <- kmeans(faithful, centers = 2)
table(hcl, icl$cluster)
##
## hcl 1 2
## 1 17 172
## 2 83 0
par(mfrow = c(1, 2))
plot(faithful, col = hcl, main = "Hierarchical clustering")
plot(faithful, col = icl$cluster, main = "K-means clustering")
points(cl2$centers, col = 1:2, pch = 8, cex = 5)
The rand index of these two clusterings is:
library(flexclust)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
randIndex(hcl, icl$cluster)
## ARI
## 0.7624052