03-pca.knit

PCA in R

data(iris) # load the iris dataset
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

# remove the species column
x <- iris[, -5]
pairs(x)

cov(x)

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    0.6856935  -0.0424340    1.2743154   0.5162707
## Sepal.Width    -0.0424340   0.1899794   -0.3296564  -0.1216394
## Petal.Length    1.2743154  -0.3296564    3.1162779   1.2956094
## Petal.Width     0.5162707  -0.1216394    1.2956094   0.5810063

cor(x)

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

Performing the PCA

The prcomp function is used to perform the PCA. The scale argument is used to scale the data. Let’s not scale it first. Let’s first try doing an eigendecomposition of the covariance matrix.

eig <- eigen(cov(x))
print(eig$vectors)

##             [,1]        [,2]        [,3]       [,4]
## [1,]  0.36138659 -0.65658877  0.58202985  0.3154872
## [2,] -0.08452251 -0.73016143 -0.59791083 -0.3197231
## [3,]  0.85667061  0.17337266 -0.07623608 -0.4798390
## [4,]  0.35828920  0.07548102 -0.54583143  0.7536574

# standard deviations of the principal components = sqrt(eigenvalues)
sqrt(eig$values)

## [1] 2.0562689 0.4926162 0.2796596 0.1543862

pc1 <- prcomp(x)
print(pc1)

## Standard deviations (1, .., p=4):
## [1] 2.0562689 0.4926162 0.2796596 0.1543862
## 
## Rotation (n x k) = (4 x 4):
##                      PC1         PC2         PC3        PC4
## Sepal.Length  0.36138659 -0.65658877  0.58202985  0.3154872
## Sepal.Width  -0.08452251 -0.73016143 -0.59791083 -0.3197231
## Petal.Length  0.85667061  0.17337266 -0.07623608 -0.4798390
## Petal.Width   0.35828920  0.07548102 -0.54583143  0.7536574

summary(pc1)

## Importance of components:
##                           PC1     PC2    PC3     PC4
## Standard deviation     2.0563 0.49262 0.2797 0.15439
## Proportion of Variance 0.9246 0.05307 0.0171 0.00521
## Cumulative Proportion  0.9246 0.97769 0.9948 1.00000

We can access the components/eigenvectors using $rotation and the standard deviations of the principal components using $sdev.

round(pc1$rotation, 2)

##                PC1   PC2   PC3   PC4
## Sepal.Length  0.36 -0.66  0.58  0.32
## Sepal.Width  -0.08 -0.73 -0.60 -0.32
## Petal.Length  0.86  0.17 -0.08 -0.48
## Petal.Width   0.36  0.08 -0.55  0.75

round(pc1$sdev, 2)

## [1] 2.06 0.49 0.28 0.15

Interpret the result

0.92 in PC1 - most of the variance is contained in the first component probablt need only the firs two components
PC1: says how big the flower is
PC2: Sepal length is strongly correlated with sepal width low value of PC2 - large sepal

Doesn’t really make sence to interpret 3 and 4 - not much information, it might not event be possible to interpret them

Transform the data

Remember that the principal components are linear combinations of the original variables. We can calculate the new data matrix yby multiplying the original data matrix x by the rotation matrix with eigenvectors as columns. The predict function does this, but centers the original data first. Let’s do it manually.

head(x)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4

# center the data
means <- colMeans(x)
print(means)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##     5.843333     3.057333     3.758000     1.199333

x_centered <- x
for (i in 1:nrow(x)) {
  x_centered[i, ] <- x[i, ] - means
}
head(x_centered)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1   -0.7433333  0.44266667       -2.358  -0.9993333
## 2   -0.9433333 -0.05733333       -2.358  -0.9993333
## 3   -1.1433333  0.14266667       -2.458  -0.9993333
## 4   -1.2433333  0.04266667       -2.258  -0.9993333
## 5   -0.8433333  0.54266667       -2.358  -0.9993333
## 6   -0.4433333  0.84266667       -2.058  -0.7993333

print(round(colMeans(x_centered), 3))

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##            0            0            0            0

y <- as.matrix(x_centered) %*% eig$vectors
head(y)

##           [,1]       [,2]        [,3]         [,4]
## [1,] -2.684126 -0.3193972  0.02791483  0.002262437
## [2,] -2.714142  0.1770012  0.21046427  0.099026550
## [3,] -2.888991  0.1449494 -0.01790026  0.019968390
## [4,] -2.745343  0.3182990 -0.03155937 -0.075575817
## [5,] -2.728717 -0.3267545 -0.09007924 -0.061258593
## [6,] -2.280860 -0.7413304 -0.16867766 -0.024200858

The predict function does the same thing. The result is also automatically stored in pc1$x by default when the prcomp function is called.

yhat <- predict(pc1)
head(yhat)

##            PC1        PC2         PC3          PC4
## [1,] -2.684126 -0.3193972  0.02791483  0.002262437
## [2,] -2.714142  0.1770012  0.21046427  0.099026550
## [3,] -2.888991  0.1449494 -0.01790026  0.019968390
## [4,] -2.745343  0.3182990 -0.03155937 -0.075575817
## [5,] -2.728717 -0.3267545 -0.09007924 -0.061258593
## [6,] -2.280860 -0.7413304 -0.16867766 -0.024200858

head(pc1$x)

##            PC1        PC2         PC3          PC4
## [1,] -2.684126 -0.3193972  0.02791483  0.002262437
## [2,] -2.714142  0.1770012  0.21046427  0.099026550
## [3,] -2.888991  0.1449494 -0.01790026  0.019968390
## [4,] -2.745343  0.3182990 -0.03155937 -0.075575817
## [5,] -2.728717 -0.3267545 -0.09007924 -0.061258593
## [6,] -2.280860 -0.7413304 -0.16867766 -0.024200858

Visualizing the results

We can change the default color palette by setting the palette function. This is useful for choosing color-blind friendly palettes.

library(khroma)

mypal <- colour("okabeito")(8)
mypal <- mypal[c(2:8, 1)]
names(mypal) <- NULL

palette(mypal)

Plotting what the PCs mean

After running par(mfrow=c(a,b)), any plots which we construct will be arranged by rows in an array of a rows and b columns. mfcol arranges plots by column instead.The function par()

abbrev <- c("SL", "SW", "PL", "PW")

par(mfrow = c(2, 2))

for (i in 1:ncol(pc1$rotation)) {
  barplot(pc1$rotation[, i],
    names.arg = abbrev, ylim = c(-1, 1), col = as.factor(abbrev),
    main = paste("PC", i, sep = "")
  )
}

Plotting original vs transformed data

Where it is possible we can use symbols beside colors to distinguish between the classes. This is done with the pch argument.

pairs(iris[, 1:4],
  col = iris[, 5], pch = as.integer(iris[, 5]),
  main = "Pairs Plot for the Iris Dataset"
)

pairs(pc1$x,
  col = iris[, 5], pch = as.integer(iris[, 5]),
  main = "Pairs Plot for the Principal Components of Iris"
)

How well does the leading PC separate the classes?

We can use a box-and-whisker plot to see how well the leading PC separates the classes. The box is drawn from the first quartile Q_1 to the third quartile Q_3 and a horizontal line through the box denotes the median. The whiskers are drawn from Q_1 - 1.5 * IQR to Q_3 + 1.5 * IQR where IQR = Q_3 - Q_1 is the interquartile range. Points outside this range are considered outliers and are plotted as individual points.

boxplot(yhat[, 1] ~ iris[, 5])

We can see that Setosa is well separated from the other two species, but there is some overlap between Versicolor and Virginica. We can also see this from the pairs plot.

plot(yhat[, 1], yhat[, 2],
  type = "n", xlab = "PC1", ylab = "PC2",
  main = "Scatter Plot of the First Two Principal Components from Iris"
)
text(yhat[, 1], yhat[, 2], labels = substr(iris[, 5], 1, 2), col = as.integer(iris[, 5]))

PCA with scaled data

Recall $Var[aX] = a^2Var[X]$. Units matter! If something is in cm and something in mm it will not be good. If I change from cm to mm, I multiply by 10, but the variance multiplies by 100.

How to make unitless data?

\[x_{ij} = (x_{ij} - \mu_j)/\sigma_j\]

The covariance matrix will be essentially replaced by the correlation matrix

eigen(cor(iris[, 1:4]))

## eigen() decomposition
## $values
## [1] 2.91849782 0.91403047 0.14675688 0.02071484
## 
## $vectors
##            [,1]        [,2]       [,3]       [,4]
## [1,]  0.5210659 -0.37741762  0.7195664  0.2612863
## [2,] -0.2693474 -0.92329566 -0.2443818 -0.1235096
## [3,]  0.5804131 -0.02449161 -0.1421264 -0.8014492
## [4,]  0.5648565 -0.06694199 -0.6342727  0.5235971

We can do this with prcomp by setting the scale argument to TRUE.

pc2 <- prcomp(x, scale = TRUE)
summary(pc2)

## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

print(pc2)

## Standard deviations (1, .., p=4):
## [1] 1.7083611 0.9560494 0.3830886 0.1439265
## 
## Rotation (n x k) = (4 x 4):
##                     PC1         PC2        PC3        PC4
## Sepal.Length  0.5210659 -0.37741762  0.7195664  0.2612863
## Sepal.Width  -0.2693474 -0.92329566 -0.2443818 -0.1235096
## Petal.Length  0.5804131 -0.02449161 -0.1421264 -0.8014492
## Petal.Width   0.5648565 -0.06694199 -0.6342727  0.5235971

The interpretation is very similar, but the coefficients are more balanced.

3 desitions when making PCA

which components to choose
how to interpret the output
how and if to scale the data

Other useful graphs

Scree plot

The scree plot shows the proportion of variance explained by each component. It is useful for deciding how many components to keep.

plot(
  pc2$sdev^2 / sum(pc2$sdev^2),
  type = "b", main = "PCA of the scaled Iris Dataset",
  xlab = "Component", ylab = "Proportion of Variance Explained"
)

GGally pairs plot

The GGally package contains a function called ggpairs which produces a more complicated version of the standard pairs plot. The lower triangle is the same as a basic pairs plot but the diagonal displays marginal density plots of every variable for each group and the upper triangle provides the overall and group-specific correlations for each pair of variables.

library(GGally)

## Warning: package 'GGally' was built under R version 4.4.2

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggpairs_iris <- ggpairs(
  data = iris, mapping = aes(colour = Species, shape = Species), columns = 1:4,
  title = "Pairs Plot for the Iris Dataset"
)
for (i in 1:4) {
  for (j in 1:4) {
    ggpairs_iris[i, j] <- (ggpairs_iris[i, j] + scale_colour_manual(values = mypal)
      + scale_fill_manual(values = mypal)
    )
  }
}
ggpairs_iris

Biplot

A biplot is a scatterplot where each data point is represented by its scores along the principal components. The variables are represented by vectors pointing in the direction of increasing variance. The length of the vector represents the importance of the variable in the principal component.

biplot(pc2)