Hierarchical clustering is a type of unsupervised machine learning algorithm used to cluster unlabeled data points. Like K-means clustering, hierarchical clustering also groups together the data points with similar characteristics.
# Load the data
data <- read.csv("datasets/olive.csv")
dim(data)
## [1] 572 10
head(data)
## Region Area palmitic palmitoleic stearic oleic linoleic linolenic
## 1 South North Apulia 1075 75 226 7823 672 36
## 2 South North Apulia 1088 73 224 7709 781 31
## 3 South North Apulia 911 54 246 8113 549 31
## 4 South North Apulia 966 57 240 7952 619 50
## 5 South North Apulia 1051 67 259 7771 672 50
## 6 South North Apulia 911 49 268 7924 678 51
## arachidic eicosenoic
## 1 60 29
## 2 61 29
## 3 63 29
## 4 78 35
## 5 80 46
## 6 70 44
Olive oil data collected from 9 different regions in Southern Italy, Northern Italy and Sicily. The first two columns are the region and area of origin, the rest are acids present in the oil. We care only about the acids.
acids <- data[, -c(1, 2)]
head(acids)
## palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic
## 1 1075 75 226 7823 672 36 60 29
## 2 1088 73 224 7709 781 31 61 29
## 3 911 54 246 8113 549 31 63 29
## 4 966 57 240 7952 619 50 78 35
## 5 1051 67 259 7771 672 50 80 46
## 6 911 49 268 7924 678 51 70 44
acids <- scale(acids, center = TRUE, scale = TRUE)
acids_dis <- dist(acids, method = "euclidean")
acids_dis_mat <- as.matrix(acids_dis)
Now we can check for distances between observations.
Euclidean distance results:
acids_dis_mat[1:5, 1:5]
## 1 2 3 4 5
## 1 0.0000000 0.6644688 1.528744 1.686387 2.076803
## 2 0.6644688 0.0000000 1.871914 2.129154 2.353875
## 3 1.5287437 1.8719138 0.000000 1.780000 2.458035
## 4 1.6863871 2.1291537 1.780000 0.000000 1.192788
## 5 2.0768026 2.3538752 2.458035 1.192788 0.000000
Manhattan distance results:
manh_dis <- dist(acids, method = "manhattan")
manh_mat <- as.matrix(manh_dis)
manh_mat[1:5, 1:5]
## 1 2 3 4 5
## 1 0.000000 1.330424 3.660028 4.229217 4.515443
## 2 1.330424 0.000000 4.052381 5.392657 5.422607
## 3 3.660028 4.052381 0.000000 3.803694 6.225028
## 4 4.229217 5.392657 3.803694 0.000000 2.747910
## 5 4.515443 5.422607 6.225028 2.747910 0.000000
clust1 <- hclust(acids_dis, method = "average")
plot(clust1, labels = FALSE, main = "Euclidean distance - Average linkage")
clust2 <- hclust(manh_dis, method = "single")
plot(clust2, labels = FALSE, main = "Euclidean distance - Single linkage")
The object clust1
contains the hierarchical clustering
information. For example:
$merge
explains or ordering in which clusters were
merged$height
contains the height at which the clusters were
merged$order
contains the order of the observations in the
dendrogramhead(clust1$merge)
## [,1] [,2]
## [1,] -548 -553
## [2,] -450 -451
## [3,] -551 1
## [4,] -439 -441
## [5,] -549 -550
## [6,] -436 -440
head(clust1$height)
## [1] 0.06422800 0.07100611 0.09037745 0.09198610 0.11046836 0.14337576
head(clust1$order)
## [1] 522 527 551 548 553 537
We can cut the tree at a certain height to get a certain number of clusters. A “recommended” height is \(\bar h+3s_h\) where \(\bar h\) is the mean height at which groups are joined, and \(s_h\) is the standard deviation of these heights.
h <- mean(clust1$height) + 3 * sd(clust1$height)
plot(clust1, labels = FALSE)
abline(h = h, lty = 2, col = 2) # adds line to an already existing plot
labels_h <- cutree(clust1, h = h)
table(labels_h)
## labels_h
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 30 74 197 1 3 5 13 98 82 1 54 1 13
We can also cut the tree at a certain number of clusters. We know that there are 9 regions, so we can cut the tree at 9 clusters.
labels_k <- cutree(clust1, k = 9)
table(labels_k)
## labels_k
## 1 2 3 4 5 6 7 8 9
## 105 210 3 5 98 82 1 67 1
To find the points assigned to a given cluster we can use the
which
function. It takes in a logical vector and returns
the indices of the TRUE
values.
which(labels_h == 9)
## [1] 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440
## [20] 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459
## [39] 460 461 462 463 464 465 466 467 468 469 470 471 472 474 475 476 477 480 481
## [58] 482 483 484 485 486 487 488 489 490 491 492 494 496 497 498 499 500 501 506
## [77] 510 511 515 516 517 518
First perform PCA on the scaled data and use the first two principal components to visualize the clusters.
pc <- prcomp(acids)
yhat <- pc$x
plot(yhat[, 1], yhat[, 2], col = labels_k, pch = labels_k, main = "Olive Oil Data with clusters")
Compare this to the actual regions.
labels_area <- as.integer(factor(data$Area))
plot(yhat[, 1], yhat[, 2], col = labels_area, pch = labels_area, main = "Olive Oil Data with actual regions")
We can see that there is certainly some agreement.
The Rand Index is a measure of the similarity between two data
clustering solutions. The randIndex
function calculates the
Adjusted Rand Index by default.
library("flexclust")
## Loading required package: grid
## Loading required package: lattice
## Loading required package: modeltools
## Loading required package: stats4
randIndex(labels_area, labels_k)
## ARI
## 0.7951361
The ARI is 0.79 which is very high, indicating that the clustering is quite good.
The uncorrected Rand Index is
randIndex(labels_area, labels_k, correct = FALSE)
## RI
## 0.9325009
We can also cross tabulate the two clusterings.
table(labels_area, labels_k)
## labels_k
## labels_area 1 2 3 4 5 6 7 8 9
## 1 55 1 0 0 0 0 0 0 0
## 2 0 0 0 0 33 0 0 0 0
## 3 0 0 0 0 0 31 1 17 1
## 4 0 0 0 0 65 0 0 0 0
## 5 25 0 0 0 0 0 0 0 0
## 6 24 6 1 5 0 0 0 0 0
## 7 1 203 2 0 0 0 0 0 0
## 8 0 0 0 0 0 51 0 0 0
## 9 0 0 0 0 0 0 0 50 0
A tanglegram is a visualization of two dendrograms side by side, with lines connecting the same observations in the two trees. We will compare the two clustering solutions, one with Euclidean distance and Complete linkage, and the other with Manhattan distance and Single linkage.
library("dendextend")
##
## ---------------------
## Welcome to dendextend version 1.18.1
## Type citation('dendextend') for how to cite the package.
##
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
##
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags:
## https://stackoverflow.com/questions/tagged/dendextend
##
## To suppress this message use: suppressPackageStartupMessages(library(dendextend))
## ---------------------
##
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
##
## cutree
library("khroma")
##
## Attaching package: 'khroma'
## The following object is masked from 'package:flexclust':
##
## info
## The following object is masked from 'package:modeltools':
##
## info
mypal <- colour("okabeito")(8)[c(2:8, 1)]
dend1 <- as.dendrogram(clust1)
dend2 <- as.dendrogram(clust2)
dend1 <- set(dend1, "branches_k_color", value = mypal, k = 9)
## Warning in get_col(col, k): Length of color vector was shorter than the number
## of clusters - color vector was recycled
dend2 <- set(dend2, "branches_k_color", value = mypal, k = 9)
## Warning in get_col(col, k): Length of color vector was shorter than the number
## of clusters - color vector was recycled
tanglegram(dend1, dend2, lab.cex = NA, lwd = 0.5, faster = TRUE)
The dendextend
package provides many other ways to
visualize dendrograms.
mypal2 <- mypal[-c(2, 4, 6)]
dend1 <- as.dendrogram(clust1)
dend1 <- color_branches(
dend = dend1, clusters = labels_k[clust1$order],
col = rep(mypal2, 2)[unique(labels_k[clust1$order])]
)
plot(dend1, leaflab = "none")
We can zoom in on sections of the graph. We do this by iteratively selecting left (1) or right (2) branch at each junction.
plot(dend1[[c(2)]])
plot(dend1[[c(2, 2)]])
plot(dend1[[c(2, 2, 1)]])
plot(dend1[[c(2, 2, 1, 1)]])
Complete and average linkage have a tendency to produce spherical clusters, while single linkage has a tendency to add single points to an existing cluster, making it larger and larger. This may result in there being very different elements in the cluster.
par(mfrow = c(1, 2))
plot(clust1, labels = FALSE, main = "Average linkage")
plot(clust2, labels = FALSE, main = "Single linkage")
labels_a <- cutree(clust1, k = 3)
labels_s <- cutree(clust2, k = 3)
plot(yhat[, 1], yhat[, 2], col = labels_a, pch = labels_a, main = "Average linkage")
plot(yhat[, 1], yhat[, 2], col = labels_s, pch = labels_s, main = "Single linkage")
But this can sometimes be useful. Consider the following example.
theta <- runif(200, -pi, pi)
r <- rnorm(200, rep(c(2, 10), c(50, 100)), sd = 0.5)
x <- r * cos(theta)
y <- r * sin(theta)
plot(x, y)
Now we will perform hierarchical clustering with single vs average linkage.
data <- cbind(x, y)
clust_a <- hclust(dist(data), method = "average")
clust_s <- hclust(dist(data), method = "single")
plot(clust_a, labels = FALSE, main = "Average linkage")
plot(clust_s, labels = FALSE, main = "Single linkage")
labels_a <- cutree(clust_a, k = 2)
labels_s <- cutree(clust_s, k = 2)
plot(x, y, col = labels_a, pch = labels_a, main = "Average linkage")
plot(x, y, col = labels_s, pch = labels_s, main = "Single linkage")