04-hierarchical-clustering.knit

Hierarchical Clustering

Hierarchical clustering is a type of unsupervised machine learning algorithm used to cluster unlabeled data points. Like K-means clustering, hierarchical clustering also groups together the data points with similar characteristics.

# Load the data
data <- read.csv("datasets/olive.csv")
dim(data)

## [1] 572  10

head(data)

##   Region         Area palmitic palmitoleic stearic oleic linoleic linolenic
## 1  South North Apulia     1075          75     226  7823      672        36
## 2  South North Apulia     1088          73     224  7709      781        31
## 3  South North Apulia      911          54     246  8113      549        31
## 4  South North Apulia      966          57     240  7952      619        50
## 5  South North Apulia     1051          67     259  7771      672        50
## 6  South North Apulia      911          49     268  7924      678        51
##   arachidic eicosenoic
## 1        60         29
## 2        61         29
## 3        63         29
## 4        78         35
## 5        80         46
## 6        70         44

Olive oil data collected from 9 different regions in Southern Italy, Northern Italy and Sicily. The first two columns are the region and area of origin, the rest are acids present in the oil. We care only about the acids.

acids <- data[, -c(1, 2)]
head(acids)

##   palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic
## 1     1075          75     226  7823      672        36        60         29
## 2     1088          73     224  7709      781        31        61         29
## 3      911          54     246  8113      549        31        63         29
## 4      966          57     240  7952      619        50        78         35
## 5     1051          67     259  7771      672        50        80         46
## 6      911          49     268  7924      678        51        70         44

Scale the data

acids <- scale(acids, center = TRUE, scale = TRUE)

Create a distance matrix

acids_dis <- dist(acids, method = "euclidean")
acids_dis_mat <- as.matrix(acids_dis)

Now we can check for distances between observations.

Euclidean distance results:

acids_dis_mat[1:5, 1:5]

##           1         2        3        4        5
## 1 0.0000000 0.6644688 1.528744 1.686387 2.076803
## 2 0.6644688 0.0000000 1.871914 2.129154 2.353875
## 3 1.5287437 1.8719138 0.000000 1.780000 2.458035
## 4 1.6863871 2.1291537 1.780000 0.000000 1.192788
## 5 2.0768026 2.3538752 2.458035 1.192788 0.000000

Manhattan distance results:

manh_dis <- dist(acids, method = "manhattan")
manh_mat <- as.matrix(manh_dis)
manh_mat[1:5, 1:5]

##          1        2        3        4        5
## 1 0.000000 1.330424 3.660028 4.229217 4.515443
## 2 1.330424 0.000000 4.052381 5.392657 5.422607
## 3 3.660028 4.052381 0.000000 3.803694 6.225028
## 4 4.229217 5.392657 3.803694 0.000000 2.747910
## 5 4.515443 5.422607 6.225028 2.747910 0.000000

Hierarchical clustering

clust1 <- hclust(acids_dis, method = "average")
plot(clust1, labels = FALSE, main = "Euclidean distance - Average linkage")

clust2 <- hclust(manh_dis, method = "single")
plot(clust2, labels = FALSE, main = "Euclidean distance - Single linkage")

The object clust1 contains the hierarchical clustering information. For example:

$merge explains or ordering in which clusters were merged
$height contains the height at which the clusters were merged
$order contains the order of the observations in the dendrogram

head(clust1$merge)

##      [,1] [,2]
## [1,] -548 -553
## [2,] -450 -451
## [3,] -551    1
## [4,] -439 -441
## [5,] -549 -550
## [6,] -436 -440

head(clust1$height)

## [1] 0.06422800 0.07100611 0.09037745 0.09198610 0.11046836 0.14337576

head(clust1$order)

## [1] 522 527 551 548 553 537

Cut the tree

We can cut the tree at a certain height to get a certain number of clusters. A “recommended” height is $\bar h+3s_h$ where $\bar h$ is the mean height at which groups are joined, and $s_h$ is the standard deviation of these heights.

h <- mean(clust1$height) + 3 * sd(clust1$height)

plot(clust1, labels = FALSE)
abline(h = h, lty = 2, col = 2) # adds line to an already existing plot

labels_h <- cutree(clust1, h = h)
table(labels_h)

## labels_h
##   1   2   3   4   5   6   7   8   9  10  11  12  13 
##  30  74 197   1   3   5  13  98  82   1  54   1  13

We can also cut the tree at a certain number of clusters. We know that there are 9 regions, so we can cut the tree at 9 clusters.

labels_k <- cutree(clust1, k = 9)
table(labels_k)

## labels_k
##   1   2   3   4   5   6   7   8   9 
## 105 210   3   5  98  82   1  67   1

To find the points assigned to a given cluster we can use the which function. It takes in a logical vector and returns the indices of the TRUE values.

which(labels_h == 9)

##  [1] 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440
## [20] 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459
## [39] 460 461 462 463 464 465 466 467 468 469 470 471 472 474 475 476 477 480 481
## [58] 482 483 484 485 486 487 488 489 490 491 492 494 496 497 498 499 500 501 506
## [77] 510 511 515 516 517 518

Visualize the clusters

First perform PCA on the scaled data and use the first two principal components to visualize the clusters.

pc <- prcomp(acids)
yhat <- pc$x
plot(yhat[, 1], yhat[, 2], col = labels_k, pch = labels_k, main = "Olive Oil Data with clusters")

Compare this to the actual regions.

labels_area <- as.integer(factor(data$Area))
plot(yhat[, 1], yhat[, 2], col = labels_area, pch = labels_area, main = "Olive Oil Data with actual regions")

We can see that there is certainly some agreement.

Rand Index

The Rand Index is a measure of the similarity between two data clustering solutions. The randIndex function calculates the Adjusted Rand Index by default.

library("flexclust")

## Loading required package: grid

## Loading required package: lattice

## Loading required package: modeltools

## Loading required package: stats4

randIndex(labels_area, labels_k)

##       ARI 
## 0.7951361

The ARI is 0.79 which is very high, indicating that the clustering is quite good.

The uncorrected Rand Index is

randIndex(labels_area, labels_k, correct = FALSE)

##        RI 
## 0.9325009

We can also cross tabulate the two clusterings.

table(labels_area, labels_k)

##            labels_k
## labels_area   1   2   3   4   5   6   7   8   9
##           1  55   1   0   0   0   0   0   0   0
##           2   0   0   0   0  33   0   0   0   0
##           3   0   0   0   0   0  31   1  17   1
##           4   0   0   0   0  65   0   0   0   0
##           5  25   0   0   0   0   0   0   0   0
##           6  24   6   1   5   0   0   0   0   0
##           7   1 203   2   0   0   0   0   0   0
##           8   0   0   0   0   0  51   0   0   0
##           9   0   0   0   0   0   0   0  50   0

Tanglegram

A tanglegram is a visualization of two dendrograms side by side, with lines connecting the same observations in the two trees. We will compare the two clustering solutions, one with Euclidean distance and Complete linkage, and the other with Manhattan distance and Single linkage.

library("dendextend")

## 
## ---------------------
## Welcome to dendextend version 1.18.1
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags: 
##   https://stackoverflow.com/questions/tagged/dendextend
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------

## 
## Attaching package: 'dendextend'

## The following object is masked from 'package:stats':
## 
##     cutree

library("khroma")

## 
## Attaching package: 'khroma'

## The following object is masked from 'package:flexclust':
## 
##     info

## The following object is masked from 'package:modeltools':
## 
##     info

mypal <- colour("okabeito")(8)[c(2:8, 1)]

dend1 <- as.dendrogram(clust1)
dend2 <- as.dendrogram(clust2)

dend1 <- set(dend1, "branches_k_color", value = mypal, k = 9)

## Warning in get_col(col, k): Length of color vector was shorter than the number
## of clusters - color vector was recycled

dend2 <- set(dend2, "branches_k_color", value = mypal, k = 9)

## Warning in get_col(col, k): Length of color vector was shorter than the number
## of clusters - color vector was recycled

tanglegram(dend1, dend2, lab.cex = NA, lwd = 0.5, faster = TRUE)

Alternative Plots

The dendextend package provides many other ways to visualize dendrograms.

mypal2 <- mypal[-c(2, 4, 6)]

dend1 <- as.dendrogram(clust1)
dend1 <- color_branches(
  dend = dend1, clusters = labels_k[clust1$order],
  col = rep(mypal2, 2)[unique(labels_k[clust1$order])]
)

plot(dend1, leaflab = "none")

We can zoom in on sections of the graph. We do this by iteratively selecting left (1) or right (2) branch at each junction.

plot(dend1[[c(2)]])

plot(dend1[[c(2, 2)]])

plot(dend1[[c(2, 2, 1)]])

plot(dend1[[c(2, 2, 1, 1)]])

Single linkage chaining example

Complete and average linkage have a tendency to produce spherical clusters, while single linkage has a tendency to add single points to an existing cluster, making it larger and larger. This may result in there being very different elements in the cluster.

par(mfrow = c(1, 2))
plot(clust1, labels = FALSE, main = "Average linkage")
plot(clust2, labels = FALSE, main = "Single linkage")

labels_a <- cutree(clust1, k = 3)
labels_s <- cutree(clust2, k = 3)
plot(yhat[, 1], yhat[, 2], col = labels_a, pch = labels_a, main = "Average linkage")

plot(yhat[, 1], yhat[, 2], col = labels_s, pch = labels_s, main = "Single linkage")

But this can sometimes be useful. Consider the following example.

theta <- runif(200, -pi, pi)
r <- rnorm(200, rep(c(2, 10), c(50, 100)), sd = 0.5)
x <- r * cos(theta)
y <- r * sin(theta)
plot(x, y)

Now we will perform hierarchical clustering with single vs average linkage.

data <- cbind(x, y)
clust_a <- hclust(dist(data), method = "average")
clust_s <- hclust(dist(data), method = "single")
plot(clust_a, labels = FALSE, main = "Average linkage")

plot(clust_s, labels = FALSE, main = "Single linkage")

labels_a <- cutree(clust_a, k = 2)
labels_s <- cutree(clust_s, k = 2)
plot(x, y, col = labels_a, pch = labels_a, main = "Average linkage")

plot(x, y, col = labels_s, pch = labels_s, main = "Single linkage")