Hierarchical clustering is a type of unsupervised machine learning algorithm used to cluster unlabeled data points. Like K-means clustering, hierarchical clustering also groups together the data points with similar characteristics.
# Load the data
data <- read.csv("datasets/olive.csv")
dim(data)## [1] 572  10head(data)##   Region         Area palmitic palmitoleic stearic oleic linoleic linolenic
## 1  South North Apulia     1075          75     226  7823      672        36
## 2  South North Apulia     1088          73     224  7709      781        31
## 3  South North Apulia      911          54     246  8113      549        31
## 4  South North Apulia      966          57     240  7952      619        50
## 5  South North Apulia     1051          67     259  7771      672        50
## 6  South North Apulia      911          49     268  7924      678        51
##   arachidic eicosenoic
## 1        60         29
## 2        61         29
## 3        63         29
## 4        78         35
## 5        80         46
## 6        70         44Olive oil data collected from 9 different regions in Southern Italy, Northern Italy and Sicily. The first two columns are the region and area of origin, the rest are acids present in the oil. We care only about the acids.
acids <- data[, -c(1, 2)]
head(acids)##   palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic
## 1     1075          75     226  7823      672        36        60         29
## 2     1088          73     224  7709      781        31        61         29
## 3      911          54     246  8113      549        31        63         29
## 4      966          57     240  7952      619        50        78         35
## 5     1051          67     259  7771      672        50        80         46
## 6      911          49     268  7924      678        51        70         44acids <- scale(acids, center = TRUE, scale = TRUE)acids_dis <- dist(acids, method = "euclidean")
acids_dis_mat <- as.matrix(acids_dis)Now we can check for distances between observations.
Euclidean distance results:
acids_dis_mat[1:5, 1:5]##           1         2        3        4        5
## 1 0.0000000 0.6644688 1.528744 1.686387 2.076803
## 2 0.6644688 0.0000000 1.871914 2.129154 2.353875
## 3 1.5287437 1.8719138 0.000000 1.780000 2.458035
## 4 1.6863871 2.1291537 1.780000 0.000000 1.192788
## 5 2.0768026 2.3538752 2.458035 1.192788 0.000000Manhattan distance results:
manh_dis <- dist(acids, method = "manhattan")
manh_mat <- as.matrix(manh_dis)
manh_mat[1:5, 1:5]##          1        2        3        4        5
## 1 0.000000 1.330424 3.660028 4.229217 4.515443
## 2 1.330424 0.000000 4.052381 5.392657 5.422607
## 3 3.660028 4.052381 0.000000 3.803694 6.225028
## 4 4.229217 5.392657 3.803694 0.000000 2.747910
## 5 4.515443 5.422607 6.225028 2.747910 0.000000clust1 <- hclust(acids_dis, method = "average")
plot(clust1, labels = FALSE, main = "Euclidean distance - Average linkage")clust2 <- hclust(manh_dis, method = "single")
plot(clust2, labels = FALSE, main = "Euclidean distance - Single linkage")The object clust1 contains the hierarchical clustering
information. For example:
$merge explains or ordering in which clusters were
merged$height contains the height at which the clusters were
merged$order contains the order of the observations in the
dendrogramhead(clust1$merge)##      [,1] [,2]
## [1,] -548 -553
## [2,] -450 -451
## [3,] -551    1
## [4,] -439 -441
## [5,] -549 -550
## [6,] -436 -440head(clust1$height)## [1] 0.06422800 0.07100611 0.09037745 0.09198610 0.11046836 0.14337576head(clust1$order)## [1] 522 527 551 548 553 537We can cut the tree at a certain height to get a certain number of clusters. A “recommended” height is \(\bar h+3s_h\) where \(\bar h\) is the mean height at which groups are joined, and \(s_h\) is the standard deviation of these heights.
h <- mean(clust1$height) + 3 * sd(clust1$height)
plot(clust1, labels = FALSE)
abline(h = h, lty = 2, col = 2) # adds line to an already existing plotlabels_h <- cutree(clust1, h = h)
table(labels_h)## labels_h
##   1   2   3   4   5   6   7   8   9  10  11  12  13 
##  30  74 197   1   3   5  13  98  82   1  54   1  13We can also cut the tree at a certain number of clusters. We know that there are 9 regions, so we can cut the tree at 9 clusters.
labels_k <- cutree(clust1, k = 9)
table(labels_k)## labels_k
##   1   2   3   4   5   6   7   8   9 
## 105 210   3   5  98  82   1  67   1To find the points assigned to a given cluster we can use the
which function. It takes in a logical vector and returns
the indices of the TRUE values.
which(labels_h == 9)##  [1] 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440
## [20] 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459
## [39] 460 461 462 463 464 465 466 467 468 469 470 471 472 474 475 476 477 480 481
## [58] 482 483 484 485 486 487 488 489 490 491 492 494 496 497 498 499 500 501 506
## [77] 510 511 515 516 517 518First perform PCA on the scaled data and use the first two principal components to visualize the clusters.
pc <- prcomp(acids)
yhat <- pc$x
plot(yhat[, 1], yhat[, 2], col = labels_k, pch = labels_k, main = "Olive Oil Data with clusters")Compare this to the actual regions.
labels_area <- as.integer(factor(data$Area))
plot(yhat[, 1], yhat[, 2], col = labels_area, pch = labels_area, main = "Olive Oil Data with actual regions")We can see that there is certainly some agreement.
The Rand Index is a measure of the similarity between two data
clustering solutions. The randIndex function calculates the
Adjusted Rand Index by default.
library("flexclust")## Loading required package: grid## Loading required package: lattice## Loading required package: modeltools## Loading required package: stats4randIndex(labels_area, labels_k)##       ARI 
## 0.7951361The ARI is 0.79 which is very high, indicating that the clustering is quite good.
The uncorrected Rand Index is
randIndex(labels_area, labels_k, correct = FALSE)##        RI 
## 0.9325009We can also cross tabulate the two clusterings.
table(labels_area, labels_k)##            labels_k
## labels_area   1   2   3   4   5   6   7   8   9
##           1  55   1   0   0   0   0   0   0   0
##           2   0   0   0   0  33   0   0   0   0
##           3   0   0   0   0   0  31   1  17   1
##           4   0   0   0   0  65   0   0   0   0
##           5  25   0   0   0   0   0   0   0   0
##           6  24   6   1   5   0   0   0   0   0
##           7   1 203   2   0   0   0   0   0   0
##           8   0   0   0   0   0  51   0   0   0
##           9   0   0   0   0   0   0   0  50   0A tanglegram is a visualization of two dendrograms side by side, with lines connecting the same observations in the two trees. We will compare the two clustering solutions, one with Euclidean distance and Complete linkage, and the other with Manhattan distance and Single linkage.
library("dendextend")## 
## ---------------------
## Welcome to dendextend version 1.18.1
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags: 
##   https://stackoverflow.com/questions/tagged/dendextend
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------## 
## Attaching package: 'dendextend'## The following object is masked from 'package:stats':
## 
##     cutreelibrary("khroma")## 
## Attaching package: 'khroma'## The following object is masked from 'package:flexclust':
## 
##     info## The following object is masked from 'package:modeltools':
## 
##     infomypal <- colour("okabeito")(8)[c(2:8, 1)]
dend1 <- as.dendrogram(clust1)
dend2 <- as.dendrogram(clust2)
dend1 <- set(dend1, "branches_k_color", value = mypal, k = 9)## Warning in get_col(col, k): Length of color vector was shorter than the number
## of clusters - color vector was recycleddend2 <- set(dend2, "branches_k_color", value = mypal, k = 9)## Warning in get_col(col, k): Length of color vector was shorter than the number
## of clusters - color vector was recycledtanglegram(dend1, dend2, lab.cex = NA, lwd = 0.5, faster = TRUE)The dendextend package provides many other ways to
visualize dendrograms.
mypal2 <- mypal[-c(2, 4, 6)]
dend1 <- as.dendrogram(clust1)
dend1 <- color_branches(
  dend = dend1, clusters = labels_k[clust1$order],
  col = rep(mypal2, 2)[unique(labels_k[clust1$order])]
)
plot(dend1, leaflab = "none")We can zoom in on sections of the graph. We do this by iteratively selecting left (1) or right (2) branch at each junction.
plot(dend1[[c(2)]])plot(dend1[[c(2, 2)]])plot(dend1[[c(2, 2, 1)]])plot(dend1[[c(2, 2, 1, 1)]])Complete and average linkage have a tendency to produce spherical clusters, while single linkage has a tendency to add single points to an existing cluster, making it larger and larger. This may result in there being very different elements in the cluster.
par(mfrow = c(1, 2))
plot(clust1, labels = FALSE, main = "Average linkage")
plot(clust2, labels = FALSE, main = "Single linkage")labels_a <- cutree(clust1, k = 3)
labels_s <- cutree(clust2, k = 3)
plot(yhat[, 1], yhat[, 2], col = labels_a, pch = labels_a, main = "Average linkage")plot(yhat[, 1], yhat[, 2], col = labels_s, pch = labels_s, main = "Single linkage")But this can sometimes be useful. Consider the following example.
theta <- runif(200, -pi, pi)
r <- rnorm(200, rep(c(2, 10), c(50, 100)), sd = 0.5)
x <- r * cos(theta)
y <- r * sin(theta)
plot(x, y)Now we will perform hierarchical clustering with single vs average linkage.
data <- cbind(x, y)
clust_a <- hclust(dist(data), method = "average")
clust_s <- hclust(dist(data), method = "single")
plot(clust_a, labels = FALSE, main = "Average linkage")plot(clust_s, labels = FALSE, main = "Single linkage")labels_a <- cutree(clust_a, k = 2)
labels_s <- cutree(clust_s, k = 2)
plot(x, y, col = labels_a, pch = labels_a, main = "Average linkage")plot(x, y, col = labels_s, pch = labels_s, main = "Single linkage")