Logistic regression is a binary classification technique.
pulse <- read.table("datasets/pulse.txt", header = TRUE, stringsAsFactors = TRUE)
head(pulse)
## RestingPulse Smokes Weight
## 1 Low No 140
## 2 Low No 145
## 3 Low Yes 160
## 4 Low Yes 190
## 5 Low No 155
## 6 Low No 165
Perform logistic regression on the data.
lreg <- glm(RestingPulse ~ Smokes + Weight, data = pulse, family = binomial(logit))
summary(lreg)
##
## Call:
## glm(formula = RestingPulse ~ Smokes + Weight, family = binomial(logit),
## data = pulse)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.98717 1.67931 -1.183 0.2367
## SmokesYes -1.19297 0.55298 -2.157 0.0310 *
## Weight 0.02502 0.01226 2.042 0.0412 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 101.21 on 91 degrees of freedom
## Residual deviance: 93.64 on 89 degrees of freedom
## AIC: 99.64
##
## Number of Fisher Scoring iterations: 4
In the above code, the first argument specifies a linear model for
predicting RestingPulse from the Smokes
and
Weight
values. The second argument specifies that the
pulse
data are to be analysed. The final argument specifies
that the linear model is applied to the logit of the probability for
RestingPulse
. Remember that by default R will set groups in
alphabetical order, so that factor level High is set to the baseline
value (i.e., equal to zero) and factor level Low is set to the second
level, (i.e., equal to one.)
The interaction between smoking and weight can be included within the model by entering the following:
lreg_i <- glm(RestingPulse ~ Smokes * Weight,
data = pulse, family = binomial(logit)
)
summary(lreg_i)
##
## Call:
## glm(formula = RestingPulse ~ Smokes * Weight, family = binomial(logit),
## data = pulse)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.9769778 2.1876955 -0.904 0.366
## SmokesYes -1.2187425 3.5901342 -0.339 0.734
## Weight 0.0249468 0.0160917 1.550 0.121
## SmokesYes:Weight 0.0001804 0.0248266 0.007 0.994
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 101.21 on 91 degrees of freedom
## Residual deviance: 93.64 on 88 degrees of freedom
## AIC: 101.64
##
## Number of Fisher Scoring iterations: 4
Now load the MASS
package and load the data set
birthwt
. There is a help file for this data set that can
also be called. There are a few changes that need to be made to these
data before we can apply a logistic regression. We need to omit the
actual birth weight and ensure that certain variables are treated as
categorical (factor), rather than numerical:
library("MASS")
data(birthwt)
head(birthwt)
## low age lwt race smoke ptl ht ui ftv bwt
## 85 0 19 182 2 0 0 0 1 0 2523
## 86 0 33 155 3 0 0 0 0 3 2551
## 87 0 20 105 1 1 0 0 0 1 2557
## 88 0 21 108 1 1 0 0 1 2 2594
## 89 0 18 107 1 1 0 0 1 0 2600
## 91 0 21 124 3 0 0 0 0 0 2622
newbirthdata <- birthwt[, -10]
is.factor(newbirthdata$race)
## [1] FALSE
newbirthdata$race <- as.factor(newbirthdata$race)
is.factor(newbirthdata$race)
## [1] TRUE
ptl_sel <- newbirthdata$ptl > 1
newbirthdata$ptl[ptl_sel] <- 1
newbirthdata$ptl <- as.factor(newbirthdata$ptl)
ftv_sel <- newbirthdata$ftv > 2
newbirthdata$ftv[ftv_sel] <- 2
newbirthdata$ftv <- as.factor(newbirthdata$ftv)
head(newbirthdata)
## low age lwt race smoke ptl ht ui ftv
## 85 0 19 182 2 0 0 0 1 0
## 86 0 33 155 3 0 0 0 0 2
## 87 0 20 105 1 1 0 0 0 1
## 88 0 21 108 1 1 0 0 1 2
## 89 0 18 107 1 1 0 0 1 0
## 91 0 21 124 3 0 0 0 0 0
This code ensures that variables race, ptl and ftv are treated as categorical (factors) rather than numeric. To keep model results consistent with those presented in lectures, we have changed the ptl values so that they are indicators of whether or not there was a previous premature labour (rather than how many), and placed all ftv values (denoting physician visits) that were greater than or equal to 2 into the same factor.
birth_lreg1 <- glm(low ~ ., data = newbirthdata, family = binomial(logit))
summary(birth_lreg1)
##
## Call:
## glm(formula = low ~ ., family = binomial(logit), data = newbirthdata)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.82302 1.24471 0.661 0.50848
## age -0.03723 0.03870 -0.962 0.33602
## lwt -0.01565 0.00708 -2.211 0.02705 *
## race2 1.19241 0.53597 2.225 0.02609 *
## race3 0.74069 0.46174 1.604 0.10869
## smoke 0.75553 0.42502 1.778 0.07546 .
## ptl1 1.34376 0.48062 2.796 0.00518 **
## ht 1.91317 0.72074 2.654 0.00794 **
## ui 0.68019 0.46434 1.465 0.14296
## ftv1 -0.43638 0.47939 -0.910 0.36268
## ftv2 0.17901 0.45638 0.392 0.69488
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 195.48 on 178 degrees of freedom
## AIC: 217.48
##
## Number of Fisher Scoring iterations: 4
In the above the expression low ~ .
is used as short
hand to specify a linear model featuring all other variables within the
data set.
We can see that ptl1
and ht
as well as
lwt
and race2
are statistically
significant.
birth_lreg2 <- glm(low ~ . * ., data = newbirthdata, family = binomial(logit))
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(birth_lreg2)
##
## Call:
## glm(formula = low ~ . * ., family = binomial(logit), data = newbirthdata)
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.798e-01 9.492e+00 0.082 0.93453
## age 5.396e-02 3.962e-01 0.136 0.89167
## lwt -2.840e-02 6.896e-02 -0.412 0.68052
## race2 1.486e+01 1.226e+01 1.213 0.22518
## race3 4.014e+00 5.205e+00 0.771 0.44059
## smoke -8.298e+00 5.179e+00 -1.602 0.10908
## ptl1 1.149e+01 6.570e+00 1.749 0.08034 .
## ht -5.535e+02 4.362e+04 -0.013 0.98988
## ui -1.034e+00 7.439e+00 -0.139 0.88949
## ftv1 -4.484e+00 5.568e+00 -0.805 0.42066
## ftv2 1.395e+01 5.604e+00 2.490 0.01277 *
## age:lwt -4.952e-05 2.673e-03 -0.019 0.98522
## age:race2 -5.197e-01 5.098e-01 -1.019 0.30804
## age:race3 1.125e-02 1.550e-01 0.073 0.94213
## age:smoke 1.844e-01 1.630e-01 1.131 0.25807
## age:ptl1 -3.959e-01 2.209e-01 -1.792 0.07307 .
## age:ht -4.271e+01 3.476e+03 -0.012 0.99020
## age:ui 1.323e-01 2.056e-01 0.644 0.51981
## age:ftv1 -1.738e-01 1.686e-01 -1.031 0.30241
## age:ftv2 -4.398e-01 1.681e-01 -2.616 0.00889 **
## lwt:race2 -7.666e-02 5.801e-02 -1.321 0.18639
## lwt:race3 -3.230e-02 3.550e-02 -0.910 0.36296
## lwt:smoke 4.140e-02 2.949e-02 1.404 0.16033
## lwt:ptl1 -1.387e-02 4.128e-02 -0.336 0.73677
## lwt:ht 3.246e+00 2.585e+02 0.013 0.98998
## lwt:ui -5.014e-03 4.567e-02 -0.110 0.91259
## lwt:ftv1 5.001e-02 3.559e-02 1.405 0.16000
## lwt:ftv2 -2.341e-02 3.179e-02 -0.736 0.46152
## race2:smoke 6.130e+00 4.896e+00 1.252 0.21054
## race3:smoke -1.069e+00 1.673e+00 -0.639 0.52288
## race2:ptl1 3.313e+00 4.999e+00 0.663 0.50754
## race3:ptl1 -3.429e-01 1.861e+00 -0.184 0.85382
## race2:ht 8.432e+02 6.583e+04 0.013 0.98978
## race3:ht 1.168e+03 9.234e+04 0.013 0.98991
## race2:ui 1.153e+01 7.769e+00 1.484 0.13784
## race3:ui 1.216e+00 1.492e+00 0.815 0.41504
## race2:ftv1 4.431e+00 3.542e+00 1.251 0.21094
## race3:ftv1 1.796e+00 1.689e+00 1.064 0.28752
## race2:ftv2 4.288e+00 3.998e+00 1.073 0.28348
## race3:ftv2 -2.138e+00 1.906e+00 -1.121 0.26211
## smoke:ptl1 1.441e+00 2.143e+00 0.673 0.50125
## smoke:ht 5.826e+02 4.706e+04 0.012 0.99012
## smoke:ui -1.785e-01 1.770e+00 -0.101 0.91969
## smoke:ftv1 9.400e-01 1.760e+00 0.534 0.59333
## smoke:ftv2 -1.521e+00 1.601e+00 -0.950 0.34214
## ptl1:ht 1.328e+02 1.732e+04 0.008 0.99388
## ptl1:ui -2.083e+00 2.005e+00 -1.039 0.29882
## ptl1:ftv1 2.879e+00 1.946e+00 1.480 0.13888
## ptl1:ftv2 3.506e-01 3.018e+00 0.116 0.90752
## ht:ui NA NA NA NA
## ht:ftv1 3.539e+02 5.064e+04 0.007 0.99442
## ht:ftv2 4.680e+02 4.609e+04 0.010 0.99190
## ui:ftv1 -1.426e+00 2.206e+00 -0.647 0.51784
## ui:ftv2 -5.598e-01 1.955e+00 -0.286 0.77461
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 142.40 on 136 degrees of freedom
## AIC: 248.4
##
## Number of Fisher Scoring iterations: 20
We can see that there are a lot of very irrelevant coefficients now
and fewer significant ones. So including so many interactions is clearly
a bad idea here. But we can try including only the most significant one,
age:ftv
.
birth_lreg3 <- glm(low ~ . + age * ftv, data = newbirthdata, family = binomial(logit))
summary(birth_lreg3)
##
## Call:
## glm(formula = low ~ . + age * ftv, family = binomial(logit),
## data = newbirthdata)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.636488 1.559419 -1.049 0.29398
## age 0.085461 0.055751 1.533 0.12530
## lwt -0.017599 0.007659 -2.298 0.02157 *
## race2 0.994137 0.551155 1.804 0.07127 .
## race3 0.700673 0.491778 1.425 0.15422
## smoke 0.792971 0.452645 1.752 0.07980 .
## ptl1 1.373395 0.495983 2.769 0.00562 **
## ht 1.936209 0.747826 2.589 0.00962 **
## ui 0.938626 0.492497 1.906 0.05667 .
## ftv1 2.877897 2.254829 1.276 0.20184
## ftv2 8.265160 2.601755 3.177 0.00149 **
## age:ftv1 -0.149619 0.096401 -1.552 0.12065
## age:ftv2 -0.359464 0.115825 -3.104 0.00191 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 234.67 on 188 degrees of freedom
## Residual deviance: 183.00 on 176 degrees of freedom
## AIC: 209
##
## Number of Fisher Scoring iterations: 5
The deviance information is best used when comparing a nested sequence of models. We can then use the deviance to decide if the extra parameters used in the more complicated models are justified by the data.
Simply put, the deviance is something like the difference from the best possible model, so smaller deviance is better. Null deviance is the deviance of the simplest model with just 1 parameter.
lreg$deviance
gives us the deviance of the model
lreg
and lref$df.residual
gives us the number
of degrees of freedom of the appropriate chi-squared distribution. The
value 1 - pchisq(deviance, deg)
is the \(p\)-value of the experiment with \(H_0\) : The proposed model is true. So very
small value (<5%
) means that the model probably isn’t
true.
lreg$deviance
## [1] 93.63964
lreg$df.residual
## [1] 89
1 - pchisq(lreg$deviance, lreg$df.residual)
## [1] 0.3476501
We see that there isn’t sufficient statistical evidence to say that the model isn’t true.
We can also compare between models. For example, consider the pulse
data models lreg
and lreg_i
, that did and did
not contain interaction terms, respectively. Comparing the deviance
directly, we see that difference is very small. A more formal analysis
confirms that there is no (clear) additional benefit of including the
interaction term as a non-significant value is returned.
lreg$deviance - lreg_i$deviance
## [1] 5.27744e-05
1 - pchisq(lreg$deviance - lreg_i$deviance, lreg$df.residual - lreg_i$df.residual)
## [1] 0.9942037
The model with the lowest AIC is the best compromise between fit and complexity.
AIC(birth_lreg1)
## [1] 217.4755
AIC(birth_lreg2)
## [1] 248.4039
AIC(birth_lreg3)
## [1] 209.0006
We can see that the third model is the most favoured one.
The help file for the predict function for logistic regression can be
accessed with ?predict.glm
.
head(predict(lreg, pulse[, c(2, 3)]))
## 1 2 3 4 5 6
## 1.5159947 1.6411077 0.8234727 1.5741509 1.8913338 2.1415598
The output are neither 0 or 1 valued, nor are they even probabilities. In fact the results are given on the logit scale (i.e., the real number line), which is on the scale of the predictors in the model. To instead return output on the probability scale we can enter the following:
head(predict(lreg, pulse[, c(2, 3)], type = "response"))
## 1 2 3 4 5 6
## 0.8199479 0.8376856 0.6949730 0.8283745 0.8689075 0.8948774
Lets try to get the prediction with a direct calculation. First create a datapoint:
test_point <- data.frame(Smokes = "No", Weight = 140)
predict(lreg, test_point, type = "response")
## 1
## 0.8199479
Now manually using the coefficients:
lreg$coefficients
## (Intercept) SmokesYes Weight
## -1.98717006 -1.19297408 0.02502261
logit <- lreg$coefficients[1] + lreg$coefficients[2] * 0 + lreg$coefficients[3] * 140
exp(logit) / (1 + exp(logit))
## (Intercept)
## 0.8199479
To use the logistic regression fit as a classifier, we can select the data points which have a probability greater than some threshold. We could pick 0.5 here as our threshold, but notice that we have imbalanced groups, 22 high and 70 low pulse cases. In this case, it (arguably) makes more sense to choose a threshold of \(70/92=0.76\). (This is just one way of many to specify a threshold.)
which(as.vector(predict(lreg, pulse[, c(2, 3)], type = "response")) > 0.76)
## [1] 1 2 4 5 6 7 8 9 10 12 14 15 16 17 19 20 21 22 23 24 25 28 32 36 37
## [26] 39 40 43 44 45 47 48 49 50 51 55 56 57 59 60 61 64 65 66 68 69 71 83 89 91
res1 <- round(as.vector(predict(lreg, pulse[, c(2, 3)], type = "response")) > 0.76)
weight_range <- seq(min(pulse[, 3]), max(pulse[, 3]), length.out = 100)
res_2 <- predict(lreg, data.frame(Smokes = rep("No", 100), Weight = weight_range), type = "response")
res_3 <- predict(lreg, data.frame(Smokes = rep("Yes", 100), Weight = weight_range), type = "response")
plot(pulse[, 3], as.numeric(pulse[, 2]) - 1 + rnorm(length(pulse[, 2]), 0, 0.05),
col = pulse[, 1], pch = res1, ylab = "Jittered Smoking Indicator", xlab = "Weight", main = "Pulse data logistic regression results"
)
class <- c("Correct Low", "Correct High", "False Low", "False High")
legend("right", legend = class, col = c(2, 1, 1, 2), pch = c(1, 0, 1, 0), cex = 0.8)
lines(weight_range, res_2)
text(weight_range[1], res_2[1], "Don't Smoke", cex = 0.8, pos = 4)
lines(weight_range, res_3)
text(weight_range[1], res_3[1], "Smoke", cex = 0.8, pos = 4)
abline(h = 0.76, lty = 2, lwd = 2)