\(k\)-means Clustering

\(k\)-means is a simple iterative algorithm for linearly separable data. It is a partitioning method that divides the data into \(k\) clusters.

We will use it to cluster the Old Faithful gazer data set.

# Load the data
data("faithful")
head(faithful)
##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55

The data set contains two variables: eruptions and waiting. We will use these two variables to cluster the data.

Perform \(k\)-means clustering

cl2 <- kmeans(faithful, centers = 2)
table(cl2$cluster)
## 
##   1   2 
## 100 172

The object cl2$cluster contains the cluster assignments for each observation. The table function shows the number of observations in each cluster. Moreover, cl2$centers contains the cluster centers and cl2$withinss contains the within-cluster sums of squares.

print(cl2$centers) # Cluster centers
##   eruptions  waiting
## 1   2.09433 54.75000
## 2   4.29793 80.28488
print(cl2$withinss) # Within-cluster sums of squares
## [1] 3456.178 5445.591
print(cl2$tot.withinss) # Total sum of squared distances = sum(cl2$withinss)
## [1] 8901.769

Visualize the clusters

plot(faithful, col = cl2$cluster)
points(cl2$centers, col = 1:2, pch = 8, cex = 5)

It is quite clear that there are 2 clusters, but how can we choose appropriate \(k\) if we didn’t know it?

Elbow method

Calculate \(k\)-means for different \(k\) and plot \(SS\) vs \(k\).

SS <- rep(0, 10)
for (k in 1:10) {
  cl <- kmeans(faithful, centers = k)
  SS[k] <- cl$tot.withinss
}
plot(1:10, SS, type = "b", xlab = "k", ylab = "SS")

Assessing the quality of the clusters

A good thing to do is to calculate the distance between cluster centers. And for each cluster calculate the average distance to its cluster center. If the average distance is smaller than the distance between cluster centers, then the clustering is compact and good.

We can define a function to do this.

average_distance <- function(data, center) {
  n <- nrow(data)
  # add the center to the data as the last row
  # get the distance matrix and sum the last row
  total <- sum(as.matrix(dist(rbind(data, center)))[n + 1, ])
  return(total / n)
}

avg_dist_to_cluster_centers <- function(data, centers, clusters) {
  k <- nrow(centers)
  avg_dist <- rep(0, k)
  for (i in 1:k) {
    cluster_data <- data[which(clusters == i), ]
    avg_dist[i] <- average_distance(cluster_data, centers[i, ])
  }
  return(avg_dist)
}

We can now see how the clustering looks for \(k = 2\). First we calculate the distance between cluster centers.

dist(cl2$centers)
##          1
## 2 25.62979

And now the average distance to cluster centers.

avg_dist_to_cluster_centers(faithful, cl2$centers, cl2$cluster)
## [1] 4.899079 4.556494

We can see that the clusters are vary compact.

Let’s try switching to \(k = 3\) and see how the clustering looks.

cl3 <- kmeans(faithful, centers = 3)
plot(faithful, col = cl3$cluster)
points(cl3$centers, col = 1:3, pch = 8, cex = 5)

print(dist(cl3$centers))
##          1        2
## 2 17.25247         
## 3 29.94067 12.69574
avg_dist_to_cluster_centers(faithful, cl3$centers, cl3$cluster)
## [1] 4.080581 4.073555 3.023563

Silhouette width method

The silhouette width is a measure of how similar an object is to its own cluster compared to the second nearest cluster. The silhouette width ranges from -1 to 1. High value means that the clustering is well separated, values around 0 mean that the clusters are probably overlapping and negative numbers mean that the data point has probably been misclassified.

library("cluster")

dist_mat <- dist(faithful)^2
silhouette_cl2 <- silhouette(cl2$cluster, dist_mat)

summary(silhouette_cl2)
## Silhouette of 272 units in 2 clusters from silhouette.default(x = cl2$cluster, dist = dist_mat) :
##  Cluster sizes and average silhouette widths:
##       100       172 
## 0.8648855 0.8877335 
## Individual silhouette widths:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1238  0.8915  0.9336  0.8793  0.9489  0.9556
plot(silhouette_cl2, col = 1:2, border = NA)

Here we constructed a distance matrix dist_mat that recorded the squared Euclidean distance between observations, which is the distance metric used in \(k\)-means clustering. This distance matrix was then used along with the cluster labels cl2$cluster to compute the silhouette width for all observations. Overall we see high average silhouette widths for both clusters (0.86 and 0.89) for Clusters 1 and 2 respectively. Plotting the silhouette width for all observations also shows that a majority of observations have high silhouette widths.

Comparison with hierarchical clustering

hcl <- cutree(hclust(dist(faithful)), 2)
icl <- kmeans(faithful, centers = 2)
table(hcl, icl$cluster)
##    
## hcl   1   2
##   1  17 172
##   2  83   0
par(mfrow = c(1, 2))
plot(faithful, col = hcl, main = "Hierarchical clustering")
plot(faithful, col = icl$cluster, main = "K-means clustering")
points(cl2$centers, col = 1:2, pch = 8, cex = 5)

The rand index of these two clusterings is:

library(flexclust)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
randIndex(hcl, icl$cluster)
##       ARI 
## 0.7624052