data(iris) # load the iris dataset
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# remove the species column
x <- iris[, -5]
pairs(x)
cov(x)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 0.6856935 -0.0424340 1.2743154 0.5162707
## Sepal.Width -0.0424340 0.1899794 -0.3296564 -0.1216394
## Petal.Length 1.2743154 -0.3296564 3.1162779 1.2956094
## Petal.Width 0.5162707 -0.1216394 1.2956094 0.5810063
cor(x)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
## Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
## Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
## Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
The prcomp
function is used to perform the PCA. The
scale
argument is used to scale the data. Let’s not scale
it first. Let’s first try doing an eigendecomposition of the covariance
matrix.
eig <- eigen(cov(x))
print(eig$vectors)
## [,1] [,2] [,3] [,4]
## [1,] 0.36138659 -0.65658877 0.58202985 0.3154872
## [2,] -0.08452251 -0.73016143 -0.59791083 -0.3197231
## [3,] 0.85667061 0.17337266 -0.07623608 -0.4798390
## [4,] 0.35828920 0.07548102 -0.54583143 0.7536574
# standard deviations of the principal components = sqrt(eigenvalues)
sqrt(eig$values)
## [1] 2.0562689 0.4926162 0.2796596 0.1543862
pc1 <- prcomp(x)
print(pc1)
## Standard deviations (1, .., p=4):
## [1] 2.0562689 0.4926162 0.2796596 0.1543862
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Sepal.Length 0.36138659 -0.65658877 0.58202985 0.3154872
## Sepal.Width -0.08452251 -0.73016143 -0.59791083 -0.3197231
## Petal.Length 0.85667061 0.17337266 -0.07623608 -0.4798390
## Petal.Width 0.35828920 0.07548102 -0.54583143 0.7536574
summary(pc1)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 2.0563 0.49262 0.2797 0.15439
## Proportion of Variance 0.9246 0.05307 0.0171 0.00521
## Cumulative Proportion 0.9246 0.97769 0.9948 1.00000
We can access the components/eigenvectors using
$rotation
and the standard deviations of the principal
components using $sdev
.
round(pc1$rotation, 2)
## PC1 PC2 PC3 PC4
## Sepal.Length 0.36 -0.66 0.58 0.32
## Sepal.Width -0.08 -0.73 -0.60 -0.32
## Petal.Length 0.86 0.17 -0.08 -0.48
## Petal.Width 0.36 0.08 -0.55 0.75
round(pc1$sdev, 2)
## [1] 2.06 0.49 0.28 0.15
Doesn’t really make sence to interpret 3 and 4 - not much information, it might not event be possible to interpret them
Remember that the principal components are linear combinations of the
original variables. We can calculate the new data matrix
y
by multiplying the original data matrix x
by
the rotation matrix with eigenvectors as columns. The
predict
function does this, but centers the original data
first. Let’s do it manually.
head(x)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5.0 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4
# center the data
means <- colMeans(x)
print(means)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843333 3.057333 3.758000 1.199333
x_centered <- x
for (i in 1:nrow(x)) {
x_centered[i, ] <- x[i, ] - means
}
head(x_centered)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 -0.7433333 0.44266667 -2.358 -0.9993333
## 2 -0.9433333 -0.05733333 -2.358 -0.9993333
## 3 -1.1433333 0.14266667 -2.458 -0.9993333
## 4 -1.2433333 0.04266667 -2.258 -0.9993333
## 5 -0.8433333 0.54266667 -2.358 -0.9993333
## 6 -0.4433333 0.84266667 -2.058 -0.7993333
print(round(colMeans(x_centered), 3))
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0 0 0 0
y <- as.matrix(x_centered) %*% eig$vectors
head(y)
## [,1] [,2] [,3] [,4]
## [1,] -2.684126 -0.3193972 0.02791483 0.002262437
## [2,] -2.714142 0.1770012 0.21046427 0.099026550
## [3,] -2.888991 0.1449494 -0.01790026 0.019968390
## [4,] -2.745343 0.3182990 -0.03155937 -0.075575817
## [5,] -2.728717 -0.3267545 -0.09007924 -0.061258593
## [6,] -2.280860 -0.7413304 -0.16867766 -0.024200858
The predict
function does the same thing. The result is
also automatically stored in pc1$x
by default when the
prcomp
function is called.
yhat <- predict(pc1)
head(yhat)
## PC1 PC2 PC3 PC4
## [1,] -2.684126 -0.3193972 0.02791483 0.002262437
## [2,] -2.714142 0.1770012 0.21046427 0.099026550
## [3,] -2.888991 0.1449494 -0.01790026 0.019968390
## [4,] -2.745343 0.3182990 -0.03155937 -0.075575817
## [5,] -2.728717 -0.3267545 -0.09007924 -0.061258593
## [6,] -2.280860 -0.7413304 -0.16867766 -0.024200858
head(pc1$x)
## PC1 PC2 PC3 PC4
## [1,] -2.684126 -0.3193972 0.02791483 0.002262437
## [2,] -2.714142 0.1770012 0.21046427 0.099026550
## [3,] -2.888991 0.1449494 -0.01790026 0.019968390
## [4,] -2.745343 0.3182990 -0.03155937 -0.075575817
## [5,] -2.728717 -0.3267545 -0.09007924 -0.061258593
## [6,] -2.280860 -0.7413304 -0.16867766 -0.024200858
We can change the default color palette by setting the
palette
function. This is useful for choosing color-blind
friendly palettes.
library(khroma)
mypal <- colour("okabeito")(8)
mypal <- mypal[c(2:8, 1)]
names(mypal) <- NULL
palette(mypal)
After running par(mfrow=c(a,b))
, any plots which we
construct will be arranged by rows in an array of a
rows
and b
columns. mfcol
arranges plots by column
instead.The function par()
abbrev <- c("SL", "SW", "PL", "PW")
par(mfrow = c(2, 2))
for (i in 1:ncol(pc1$rotation)) {
barplot(pc1$rotation[, i],
names.arg = abbrev, ylim = c(-1, 1), col = as.factor(abbrev),
main = paste("PC", i, sep = "")
)
}
Where it is possible we can use symbols beside colors to distinguish
between the classes. This is done with the pch
argument.
pairs(iris[, 1:4],
col = iris[, 5], pch = as.integer(iris[, 5]),
main = "Pairs Plot for the Iris Dataset"
)
pairs(pc1$x,
col = iris[, 5], pch = as.integer(iris[, 5]),
main = "Pairs Plot for the Principal Components of Iris"
)
We can use a box-and-whisker plot to see how well the leading PC
separates the classes. The box is drawn from the first quartile
Q_1
to the third quartile Q_3
and a horizontal
line through the box denotes the median. The whiskers are drawn from
Q_1 - 1.5 * IQR
to Q_3 + 1.5 * IQR
where
IQR = Q_3 - Q_1
is the interquartile range. Points outside
this range are considered outliers and are plotted as individual
points.
boxplot(yhat[, 1] ~ iris[, 5])
We can see that Setosa is well separated from the other two species, but there is some overlap between Versicolor and Virginica. We can also see this from the pairs plot.
plot(yhat[, 1], yhat[, 2],
type = "n", xlab = "PC1", ylab = "PC2",
main = "Scatter Plot of the First Two Principal Components from Iris"
)
text(yhat[, 1], yhat[, 2], labels = substr(iris[, 5], 1, 2), col = as.integer(iris[, 5]))
Recall \(Var[aX] = a^2Var[X]\).
Units matter! If something is in cm
and something in
mm
it will not be good. If I change from cm
to
mm
, I multiply by 10, but the variance multiplies by
100.
How to make unitless data?
\[x_{ij} = (x_{ij} - \mu_j)/\sigma_j\]
The covariance matrix will be essentially replaced by the correlation matrix
eigen(cor(iris[, 1:4]))
## eigen() decomposition
## $values
## [1] 2.91849782 0.91403047 0.14675688 0.02071484
##
## $vectors
## [,1] [,2] [,3] [,4]
## [1,] 0.5210659 -0.37741762 0.7195664 0.2612863
## [2,] -0.2693474 -0.92329566 -0.2443818 -0.1235096
## [3,] 0.5804131 -0.02449161 -0.1421264 -0.8014492
## [4,] 0.5648565 -0.06694199 -0.6342727 0.5235971
We can do this with prcomp
by setting the
scale
argument to TRUE
.
pc2 <- prcomp(x, scale = TRUE)
summary(pc2)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
print(pc2)
## Standard deviations (1, .., p=4):
## [1] 1.7083611 0.9560494 0.3830886 0.1439265
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863
## Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096
## Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492
## Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971
The interpretation is very similar, but the coefficients are more balanced.
The scree plot shows the proportion of variance explained by each component. It is useful for deciding how many components to keep.
plot(
pc2$sdev^2 / sum(pc2$sdev^2),
type = "b", main = "PCA of the scaled Iris Dataset",
xlab = "Component", ylab = "Proportion of Variance Explained"
)
The GGally
package contains a function called
ggpairs
which produces a more complicated version of the
standard pairs
plot. The lower triangle is the same as a
basic pairs plot but the diagonal displays marginal density plots of
every variable for each group and the upper triangle provides the
overall and group-specific correlations for each pair of variables.
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.2
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs_iris <- ggpairs(
data = iris, mapping = aes(colour = Species, shape = Species), columns = 1:4,
title = "Pairs Plot for the Iris Dataset"
)
for (i in 1:4) {
for (j in 1:4) {
ggpairs_iris[i, j] <- (ggpairs_iris[i, j] + scale_colour_manual(values = mypal)
+ scale_fill_manual(values = mypal)
)
}
}
ggpairs_iris
A biplot is a scatterplot where each data point is represented by its scores along the principal components. The variables are represented by vectors pointing in the direction of increasing variance. The length of the vector represents the importance of the variable in the principal component.
biplot(pc2)