Logistic Regression

Logistic regression is a binary classification technique.

pulse <- read.table("datasets/pulse.txt", header = TRUE, stringsAsFactors = TRUE)
head(pulse)
##   RestingPulse Smokes Weight
## 1          Low     No    140
## 2          Low     No    145
## 3          Low    Yes    160
## 4          Low    Yes    190
## 5          Low     No    155
## 6          Low     No    165

Perform logistic regression on the data.

lreg <- glm(RestingPulse ~ Smokes + Weight, data = pulse, family = binomial(logit))
summary(lreg)
## 
## Call:
## glm(formula = RestingPulse ~ Smokes + Weight, family = binomial(logit), 
##     data = pulse)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -1.98717    1.67931  -1.183   0.2367  
## SmokesYes   -1.19297    0.55298  -2.157   0.0310 *
## Weight       0.02502    0.01226   2.042   0.0412 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 101.21  on 91  degrees of freedom
## Residual deviance:  93.64  on 89  degrees of freedom
## AIC: 99.64
## 
## Number of Fisher Scoring iterations: 4

In the above code, the first argument specifies a linear model for predicting RestingPulse from the Smokes and Weight values. The second argument specifies that the pulse data are to be analysed. The final argument specifies that the linear model is applied to the logit of the probability for RestingPulse. Remember that by default R will set groups in alphabetical order, so that factor level High is set to the baseline value (i.e., equal to zero) and factor level Low is set to the second level, (i.e., equal to one.)

Including interactions

The interaction between smoking and weight can be included within the model by entering the following:

lreg_i <- glm(RestingPulse ~ Smokes * Weight,
  data = pulse, family = binomial(logit)
)
summary(lreg_i)
## 
## Call:
## glm(formula = RestingPulse ~ Smokes * Weight, family = binomial(logit), 
##     data = pulse)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)      -1.9769778  2.1876955  -0.904    0.366
## SmokesYes        -1.2187425  3.5901342  -0.339    0.734
## Weight            0.0249468  0.0160917   1.550    0.121
## SmokesYes:Weight  0.0001804  0.0248266   0.007    0.994
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 101.21  on 91  degrees of freedom
## Residual deviance:  93.64  on 88  degrees of freedom
## AIC: 101.64
## 
## Number of Fisher Scoring iterations: 4

Analyzing birth data

Now load the MASS package and load the data set birthwt. There is a help file for this data set that can also be called. There are a few changes that need to be made to these data before we can apply a logistic regression. We need to omit the actual birth weight and ensure that certain variables are treated as categorical (factor), rather than numerical:

library("MASS")
data(birthwt)
head(birthwt)
##    low age lwt race smoke ptl ht ui ftv  bwt
## 85   0  19 182    2     0   0  0  1   0 2523
## 86   0  33 155    3     0   0  0  0   3 2551
## 87   0  20 105    1     1   0  0  0   1 2557
## 88   0  21 108    1     1   0  0  1   2 2594
## 89   0  18 107    1     1   0  0  1   0 2600
## 91   0  21 124    3     0   0  0  0   0 2622
newbirthdata <- birthwt[, -10]
is.factor(newbirthdata$race)
## [1] FALSE
newbirthdata$race <- as.factor(newbirthdata$race)
is.factor(newbirthdata$race)
## [1] TRUE
ptl_sel <- newbirthdata$ptl > 1
newbirthdata$ptl[ptl_sel] <- 1
newbirthdata$ptl <- as.factor(newbirthdata$ptl)

ftv_sel <- newbirthdata$ftv > 2
newbirthdata$ftv[ftv_sel] <- 2
newbirthdata$ftv <- as.factor(newbirthdata$ftv)

head(newbirthdata)
##    low age lwt race smoke ptl ht ui ftv
## 85   0  19 182    2     0   0  0  1   0
## 86   0  33 155    3     0   0  0  0   2
## 87   0  20 105    1     1   0  0  0   1
## 88   0  21 108    1     1   0  0  1   2
## 89   0  18 107    1     1   0  0  1   0
## 91   0  21 124    3     0   0  0  0   0

This code ensures that variables race, ptl and ftv are treated as categorical (factors) rather than numeric. To keep model results consistent with those presented in lectures, we have changed the ptl values so that they are indicators of whether or not there was a previous premature labour (rather than how many), and placed all ftv values (denoting physician visits) that were greater than or equal to 2 into the same factor.

Performing logistic regression

birth_lreg1 <- glm(low ~ ., data = newbirthdata, family = binomial(logit))
summary(birth_lreg1)
## 
## Call:
## glm(formula = low ~ ., family = binomial(logit), data = newbirthdata)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.82302    1.24471   0.661  0.50848   
## age         -0.03723    0.03870  -0.962  0.33602   
## lwt         -0.01565    0.00708  -2.211  0.02705 * 
## race2        1.19241    0.53597   2.225  0.02609 * 
## race3        0.74069    0.46174   1.604  0.10869   
## smoke        0.75553    0.42502   1.778  0.07546 . 
## ptl1         1.34376    0.48062   2.796  0.00518 **
## ht           1.91317    0.72074   2.654  0.00794 **
## ui           0.68019    0.46434   1.465  0.14296   
## ftv1        -0.43638    0.47939  -0.910  0.36268   
## ftv2         0.17901    0.45638   0.392  0.69488   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 195.48  on 178  degrees of freedom
## AIC: 217.48
## 
## Number of Fisher Scoring iterations: 4

In the above the expression low ~ . is used as short hand to specify a linear model featuring all other variables within the data set.

We can see that ptl1 and ht as well as lwt and race2 are statistically significant.

Include interactions

birth_lreg2 <- glm(low ~ . * ., data = newbirthdata, family = binomial(logit))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(birth_lreg2)
## 
## Call:
## glm(formula = low ~ . * ., family = binomial(logit), data = newbirthdata)
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  7.798e-01  9.492e+00   0.082  0.93453   
## age          5.396e-02  3.962e-01   0.136  0.89167   
## lwt         -2.840e-02  6.896e-02  -0.412  0.68052   
## race2        1.486e+01  1.226e+01   1.213  0.22518   
## race3        4.014e+00  5.205e+00   0.771  0.44059   
## smoke       -8.298e+00  5.179e+00  -1.602  0.10908   
## ptl1         1.149e+01  6.570e+00   1.749  0.08034 . 
## ht          -5.535e+02  4.362e+04  -0.013  0.98988   
## ui          -1.034e+00  7.439e+00  -0.139  0.88949   
## ftv1        -4.484e+00  5.568e+00  -0.805  0.42066   
## ftv2         1.395e+01  5.604e+00   2.490  0.01277 * 
## age:lwt     -4.952e-05  2.673e-03  -0.019  0.98522   
## age:race2   -5.197e-01  5.098e-01  -1.019  0.30804   
## age:race3    1.125e-02  1.550e-01   0.073  0.94213   
## age:smoke    1.844e-01  1.630e-01   1.131  0.25807   
## age:ptl1    -3.959e-01  2.209e-01  -1.792  0.07307 . 
## age:ht      -4.271e+01  3.476e+03  -0.012  0.99020   
## age:ui       1.323e-01  2.056e-01   0.644  0.51981   
## age:ftv1    -1.738e-01  1.686e-01  -1.031  0.30241   
## age:ftv2    -4.398e-01  1.681e-01  -2.616  0.00889 **
## lwt:race2   -7.666e-02  5.801e-02  -1.321  0.18639   
## lwt:race3   -3.230e-02  3.550e-02  -0.910  0.36296   
## lwt:smoke    4.140e-02  2.949e-02   1.404  0.16033   
## lwt:ptl1    -1.387e-02  4.128e-02  -0.336  0.73677   
## lwt:ht       3.246e+00  2.585e+02   0.013  0.98998   
## lwt:ui      -5.014e-03  4.567e-02  -0.110  0.91259   
## lwt:ftv1     5.001e-02  3.559e-02   1.405  0.16000   
## lwt:ftv2    -2.341e-02  3.179e-02  -0.736  0.46152   
## race2:smoke  6.130e+00  4.896e+00   1.252  0.21054   
## race3:smoke -1.069e+00  1.673e+00  -0.639  0.52288   
## race2:ptl1   3.313e+00  4.999e+00   0.663  0.50754   
## race3:ptl1  -3.429e-01  1.861e+00  -0.184  0.85382   
## race2:ht     8.432e+02  6.583e+04   0.013  0.98978   
## race3:ht     1.168e+03  9.234e+04   0.013  0.98991   
## race2:ui     1.153e+01  7.769e+00   1.484  0.13784   
## race3:ui     1.216e+00  1.492e+00   0.815  0.41504   
## race2:ftv1   4.431e+00  3.542e+00   1.251  0.21094   
## race3:ftv1   1.796e+00  1.689e+00   1.064  0.28752   
## race2:ftv2   4.288e+00  3.998e+00   1.073  0.28348   
## race3:ftv2  -2.138e+00  1.906e+00  -1.121  0.26211   
## smoke:ptl1   1.441e+00  2.143e+00   0.673  0.50125   
## smoke:ht     5.826e+02  4.706e+04   0.012  0.99012   
## smoke:ui    -1.785e-01  1.770e+00  -0.101  0.91969   
## smoke:ftv1   9.400e-01  1.760e+00   0.534  0.59333   
## smoke:ftv2  -1.521e+00  1.601e+00  -0.950  0.34214   
## ptl1:ht      1.328e+02  1.732e+04   0.008  0.99388   
## ptl1:ui     -2.083e+00  2.005e+00  -1.039  0.29882   
## ptl1:ftv1    2.879e+00  1.946e+00   1.480  0.13888   
## ptl1:ftv2    3.506e-01  3.018e+00   0.116  0.90752   
## ht:ui               NA         NA      NA       NA   
## ht:ftv1      3.539e+02  5.064e+04   0.007  0.99442   
## ht:ftv2      4.680e+02  4.609e+04   0.010  0.99190   
## ui:ftv1     -1.426e+00  2.206e+00  -0.647  0.51784   
## ui:ftv2     -5.598e-01  1.955e+00  -0.286  0.77461   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 142.40  on 136  degrees of freedom
## AIC: 248.4
## 
## Number of Fisher Scoring iterations: 20

We can see that there are a lot of very irrelevant coefficients now and fewer significant ones. So including so many interactions is clearly a bad idea here. But we can try including only the most significant one, age:ftv.

birth_lreg3 <- glm(low ~ . + age * ftv, data = newbirthdata, family = binomial(logit))
summary(birth_lreg3)
## 
## Call:
## glm(formula = low ~ . + age * ftv, family = binomial(logit), 
##     data = newbirthdata)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -1.636488   1.559419  -1.049  0.29398   
## age          0.085461   0.055751   1.533  0.12530   
## lwt         -0.017599   0.007659  -2.298  0.02157 * 
## race2        0.994137   0.551155   1.804  0.07127 . 
## race3        0.700673   0.491778   1.425  0.15422   
## smoke        0.792971   0.452645   1.752  0.07980 . 
## ptl1         1.373395   0.495983   2.769  0.00562 **
## ht           1.936209   0.747826   2.589  0.00962 **
## ui           0.938626   0.492497   1.906  0.05667 . 
## ftv1         2.877897   2.254829   1.276  0.20184   
## ftv2         8.265160   2.601755   3.177  0.00149 **
## age:ftv1    -0.149619   0.096401  -1.552  0.12065   
## age:ftv2    -0.359464   0.115825  -3.104  0.00191 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.67  on 188  degrees of freedom
## Residual deviance: 183.00  on 176  degrees of freedom
## AIC: 209
## 
## Number of Fisher Scoring iterations: 5

Model Comparison

Deviance

The deviance information is best used when comparing a nested sequence of models. We can then use the deviance to decide if the extra parameters used in the more complicated models are justified by the data.

Simply put, the deviance is something like the difference from the best possible model, so smaller deviance is better. Null deviance is the deviance of the simplest model with just 1 parameter.

lreg$deviance gives us the deviance of the model lreg and lref$df.residual gives us the number of degrees of freedom of the appropriate chi-squared distribution. The value 1 - pchisq(deviance, deg) is the \(p\)-value of the experiment with \(H_0\) : The proposed model is true. So very small value (<5%) means that the model probably isn’t true.

lreg$deviance
## [1] 93.63964
lreg$df.residual
## [1] 89
1 - pchisq(lreg$deviance, lreg$df.residual)
## [1] 0.3476501

We see that there isn’t sufficient statistical evidence to say that the model isn’t true.

We can also compare between models. For example, consider the pulse data models lreg and lreg_i, that did and did not contain interaction terms, respectively. Comparing the deviance directly, we see that difference is very small. A more formal analysis confirms that there is no (clear) additional benefit of including the interaction term as a non-significant value is returned.

lreg$deviance - lreg_i$deviance
## [1] 5.27744e-05
1 - pchisq(lreg$deviance - lreg_i$deviance, lreg$df.residual - lreg_i$df.residual)
## [1] 0.9942037

Akaike’s information criterion (AIC)

The model with the lowest AIC is the best compromise between fit and complexity.

AIC(birth_lreg1)
## [1] 217.4755
AIC(birth_lreg2)
## [1] 248.4039
AIC(birth_lreg3)
## [1] 209.0006

We can see that the third model is the most favoured one.

Prediction

The help file for the predict function for logistic regression can be accessed with ?predict.glm.

head(predict(lreg, pulse[, c(2, 3)]))
##         1         2         3         4         5         6 
## 1.5159947 1.6411077 0.8234727 1.5741509 1.8913338 2.1415598

The output are neither 0 or 1 valued, nor are they even probabilities. In fact the results are given on the logit scale (i.e., the real number line), which is on the scale of the predictors in the model. To instead return output on the probability scale we can enter the following:

head(predict(lreg, pulse[, c(2, 3)], type = "response"))
##         1         2         3         4         5         6 
## 0.8199479 0.8376856 0.6949730 0.8283745 0.8689075 0.8948774

Lets try to get the prediction with a direct calculation. First create a datapoint:

test_point <- data.frame(Smokes = "No", Weight = 140)
predict(lreg, test_point, type = "response")
##         1 
## 0.8199479

Now manually using the coefficients:

lreg$coefficients
## (Intercept)   SmokesYes      Weight 
## -1.98717006 -1.19297408  0.02502261
logit <- lreg$coefficients[1] + lreg$coefficients[2] * 0 + lreg$coefficients[3] * 140
exp(logit) / (1 + exp(logit))
## (Intercept) 
##   0.8199479

Classification

To use the logistic regression fit as a classifier, we can select the data points which have a probability greater than some threshold. We could pick 0.5 here as our threshold, but notice that we have imbalanced groups, 22 high and 70 low pulse cases. In this case, it (arguably) makes more sense to choose a threshold of \(70/92=0.76\). (This is just one way of many to specify a threshold.)

which(as.vector(predict(lreg, pulse[, c(2, 3)], type = "response")) > 0.76)
##  [1]  1  2  4  5  6  7  8  9 10 12 14 15 16 17 19 20 21 22 23 24 25 28 32 36 37
## [26] 39 40 43 44 45 47 48 49 50 51 55 56 57 59 60 61 64 65 66 68 69 71 83 89 91
res1 <- round(as.vector(predict(lreg, pulse[, c(2, 3)], type = "response")) > 0.76)

weight_range <- seq(min(pulse[, 3]), max(pulse[, 3]), length.out = 100)
res_2 <- predict(lreg, data.frame(Smokes = rep("No", 100), Weight = weight_range), type = "response")
res_3 <- predict(lreg, data.frame(Smokes = rep("Yes", 100), Weight = weight_range), type = "response")


plot(pulse[, 3], as.numeric(pulse[, 2]) - 1 + rnorm(length(pulse[, 2]), 0, 0.05),
  col = pulse[, 1], pch = res1, ylab = "Jittered Smoking Indicator", xlab = "Weight", main = "Pulse data logistic regression results"
)

class <- c("Correct Low", "Correct High", "False Low", "False High")
legend("right", legend = class, col = c(2, 1, 1, 2), pch = c(1, 0, 1, 0), cex = 0.8)


lines(weight_range, res_2)
text(weight_range[1], res_2[1], "Don't Smoke", cex = 0.8, pos = 4)
lines(weight_range, res_3)
text(weight_range[1], res_3[1], "Smoke", cex = 0.8, pos = 4)
abline(h = 0.76, lty = 2, lwd = 2)