We will use the prostate cancer data set from the book Hastie et al. (2009), which investigates correlations of the PSA score (response variable) in prostate cancer patients with a number of possible factors.
load("prostate.Rdata")
names(prostate)
## [1] "X" "lcavol" "lweight" "age" "lbph" "svi" "lcp"
## [8] "gleason" "pgg45" "lpsa"
The variable descriptions are as follows: lpsa: log PSA score lcavol: log cancer volume lweight: log prostate weight age: age of patient lbph: log of the amount of benign prostatic hyperplasia svi: seminal vesicle invasion lcp: log of capsular penetration gleason: Gleason score pgg45: percent of Gleason scores 4 or 5
We treat svi and gleason as categorical variables with 2 and 4 levels, respectively.
prostate$svi=as.factor(prostate$svi)
prostate$gleason=as.factor(prostate$gleason)
We fit a main effects model with all variables. We consider this as the full model.
lfull=lm(lpsa~lcavol+lweight+age+lbph+svi+lcp+gleason+pgg45, data=prostate)
lfulls=summary(lfull)
lfulls
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi + lcp +
## gleason + pgg45, data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.77956 -0.31691 -0.04774 0.44278 1.52832
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.486274 0.885929 0.549 0.58451
## lcavol 0.549532 0.090087 6.100 2.94e-08 ***
## lweight 0.623816 0.200463 3.112 0.00252 **
## age -0.023103 0.011282 -2.048 0.04364 *
## lbph 0.091523 0.058454 1.566 0.12108
## svi1 0.744537 0.244602 3.044 0.00310 **
## lcp -0.124612 0.094565 -1.318 0.19109
## gleason7 0.253874 0.216141 1.175 0.24341
## gleason8 0.480520 0.760895 0.632 0.52938
## gleason9 -0.036003 0.494866 -0.073 0.94217
## pgg45 0.004902 0.004622 1.061 0.29184
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6973 on 86 degrees of freedom
## Multiple R-squared: 0.6731, Adjusted R-squared: 0.6351
## F-statistic: 17.71 on 10 and 86 DF, p-value: < 2.2e-16
We also consider three partial models: (i) Model lp1 includes variables lcavol, lweight, pgg45; (ii) Model lp2 includes variables lcavol, lweight, age, svi1; (ii) Model lp3 includes variables lcavol, lweight, age, svi1, lbph.
lp1=lm(lpsa~lcavol+lweight+pgg45, data=prostate)
lp1s=summary(lp1)
lp2=lm(lpsa~lcavol+lweight+age+svi, data=prostate)
lp2s=summary(lp2)
lp3=lm(lpsa~lcavol+lweight+age+svi+lbph, data=prostate)
lp3s=summary(lp3)
We find the values of \(R^2\) and adjusted \(R^2\) for each model, including the full:
c(lfulls$r.squared, lfulls$adj.r.squared)
## [1] 0.6731418 0.6351350
c(lp1s$r.squared, lp1s$adj.r.squared)
## [1] 0.6096989 0.5971086
c(lp2s$r.squared, lp2s$adj.r.squared)
## [1] 0.6408203 0.6252038
c(lp3s$r.squared, lp3s$adj.r.squared)
## [1] 0.6526150 0.6335279
In our example lp1 is nested to lfull, and lp2 is nested to lp3 and lp3 nested to lfull.
Comparisons based on \(R^2\) are suitable if models are of the same size. For nested models they do not make sense because the larger model always has greater value of \(R^2\).
Comparisons based on adjusted \(R^2\) are relevant between nested models. In our example, the comparison between lp1 and lfull shows that adjusted \(R^2\) in lp1 is significantly lower than in lfull, indicating that lp1 is not sufficient for fitting. A partial F-test between lp1 and lfull also gives the same result:
anova(lp1, lfull)
## Analysis of Variance Table
##
## Model 1: lpsa ~ lcavol + lweight + pgg45
## Model 2: lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason +
## pgg45
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 93 49.926
## 2 86 41.811 7 8.1155 2.3846 0.02812 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The null hypothesis is rejected, i.e., there is significant evidence that at least one of the terms in lfull that are not included in lp1 is statistically significant.
The comparison between lp2, lp3 and lfull shows that still lfull has the largest adjusted \(R^2\) among the three, however the differences between the three models are small, and it is questionable whether the full model is significantly better. The partial F-tests now do not reject the null hypothesis:
anova(lp2, lfull)
## Analysis of Variance Table
##
## Model 1: lpsa ~ lcavol + lweight + age + svi
## Model 2: lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason +
## pgg45
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 92 45.945
## 2 86 41.811 6 4.1345 1.4174 0.2173
anova(lp3, lfull)
## Analysis of Variance Table
##
## Model 1: lpsa ~ lcavol + lweight + age + svi + lbph
## Model 2: lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason +
## pgg45
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 91 44.437
## 2 86 41.811 5 2.6258 1.0802 0.3771
An alternative measure of comparison, which is applicable even among non-nested models, is AIC (Akaike Information Criterion). This is also applicable for generalized linear models, since it is based on maximum-likelihood estimation (for linear models we know that MLE and LSE are equivalent). It is defined as \(AIC = 2 k - 2 \log(\hat{L})\), where \(k\) is the number of estimated parameters in the model and \(\hat{L}\) is the maximum value of the likelihood function achieved after applying ML estimation. Since we want large values of the likelihood for better fit, the value of \(- 2 \log(\hat{L})\) should be small. We see that AIC penalizes the likelihood with the number of parameters of the model, in order to avoid overfitting. In general small values of AIC indicate a better model.
We can find the AIC value for a model using the AIC function:
AIC(lfull)
## [1] 217.6434
We can also call the function with many models as arguments, in which case it returns a comparison table with the AIC values for each model:
AIC(lp1, lp2, lp3, lfull)
## df AIC
## lp1 5 220.8505
## lp2 6 214.7902
## lp3 7 213.5514
## lfull 12 217.6434
Note that the degrees of freedom in the above table does not correspond to the usual definition of degrees of freedom in the linear model. It comes from an alternative definition which is more relevant in machine learning and we will not examine it here.
We see that based on AIC model lp3 is the most appropriate between the four models considered, since it shows the best combination between fit (likelihood) and complexity (number of parameters).
As a final measure of comparison, which is appropriate for comparing two nested models, we will consider Mallows’ Cp-statistic. We can compute it using the ols_mallows_cp function which is part of the olsrr library (it can be installed using the command install.packages(“olsrr”) ):
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_mallows_cp(lp1, lfull)
## [1] 13.69252
For practicing with R programming, we can also create our own function easily.
cpmallows=function(partial, full)
{
# Computes the Cp-Mallows statistic between a partial and a full model.
# Cp=SSEpartial/MSEfull - (n-2*(p+1))
# SSEpartial is the SSE for the partial model
# MSEfull is the SSE of the full model
# n=sample size
# p+1 number of estimated parameters of partial model
SSEpartial=sum(partial$residuals^2)
MSEfull=summary(full)$sigma^2
# We can compute the sample size n as the sum of the degrees of freedom of error and the length of the coefficients vector. This will be the same for both models.
n=length(full$coefficients)+full$df.residual
p1=length(partial$coefficients)
return(SSEpartial/MSEfull-(n-2*p1))
}
Both function give the same result:
ols_mallows_cp(lp1, lfull)
## [1] 13.69252
cpmallows(lp1, lfull)
## [1] 13.69252
We know that the partial model is sufficient with respect to the full, if the value of Cp is close to the number of parameters in the partial model. If it is larger, then the partial model is not sufficient. Here the number of parameters in lp1 is 4 and the value of Cp=13.7, thus lp1 is not a good model compared to lfull. For the other two models we obtain:
cpmallows(lp2, lfull)
## [1] 7.504144
cpmallows(lp3, lfull)
## [1] 6.400838
For lp2, which has size 5, Cp is still large. However for lp3 which has size 6, the value of Cp=6.4 is sufficiently low to consider lp3 as a good alternative to the full model.
Stepwise regression refers to an approach for building regression models in a successive manner using criteria of fit and model size. In R stepwise regression is performed using the step() function. The criterion for including or dropping a variable is the value of AIC. Specifically, in the forward method, to decide whether some variable not present in the model should be included, the value of AIC for the model with the new variable present is computed and if it is smaller than the original model, then the variable is a candidate for inclusion. The variable that brings the largest reduction in AIC is selected to enter the model. A similar logic is followed in the backward method for selecting a variable to drop and in stepwise that combines the two approaches.
For the step function we must specify the original model to start from when applying the forward method, the full model (using the option scope and the formula of the full model, i.e. scope=formula(lfull)) and the method (using the option direction with values “forward,” “backward” or “both”).
In our example we start with an empty model with no variables :
lp0=lm(lpsa~1, data=prostate)
To perform a forward method:
lforward=step(lp0, scope=formula(lfull), direction = "forward")
## Start: AIC=28.84
## lpsa ~ 1
##
## Df Sum of Sq RSS AIC
## + lcavol 1 69.003 58.915 -44.366
## + svi 1 41.011 86.907 -6.658
## + lcp 1 38.528 89.389 -3.926
## + gleason 3 30.548 97.370 8.369
## + lweight 1 24.019 103.899 10.665
## + pgg45 1 22.814 105.103 11.783
## + lbph 1 4.136 123.782 27.650
## + age 1 3.679 124.239 28.007
## <none> 127.918 28.838
##
## Step: AIC=-44.37
## lpsa ~ lcavol
##
## Df Sum of Sq RSS AIC
## + lweight 1 7.1726 51.742 -54.958
## + svi 1 5.2375 53.677 -51.397
## + lbph 1 3.2658 55.649 -47.898
## + pgg45 1 1.6980 57.217 -45.203
## <none> 58.915 -44.366
## + lcp 1 0.6562 58.259 -43.452
## + gleason 3 2.8395 56.075 -43.158
## + age 1 0.0025 58.912 -42.370
##
## Step: AIC=-54.96
## lpsa ~ lcavol + lweight
##
## Df Sum of Sq RSS AIC
## + svi 1 5.1737 46.568 -63.177
## + pgg45 1 1.8158 49.926 -56.424
## <none> 51.742 -54.958
## + lcp 1 0.8187 50.923 -54.506
## + gleason 3 2.7992 48.943 -54.353
## + age 1 0.6456 51.097 -54.176
## + lbph 1 0.4440 51.298 -53.794
##
## Step: AIC=-63.18
## lpsa ~ lcavol + lweight + svi
##
## Df Sum of Sq RSS AIC
## + lbph 1 0.97296 45.595 -63.226
## <none> 46.568 -63.177
## + age 1 0.62301 45.945 -62.484
## + pgg45 1 0.50069 46.068 -62.226
## + lcp 1 0.06937 46.499 -61.322
## + gleason 3 1.64140 44.927 -60.658
##
## Step: AIC=-63.23
## lpsa ~ lcavol + lweight + svi + lbph
##
## Df Sum of Sq RSS AIC
## + age 1 1.15879 44.437 -63.723
## <none> 45.595 -63.226
## + pgg45 1 0.33173 45.264 -61.934
## + lcp 1 0.10115 45.494 -61.441
## + gleason 3 1.26345 44.332 -59.951
##
## Step: AIC=-63.72
## lpsa ~ lcavol + lweight + svi + lbph + age
##
## Df Sum of Sq RSS AIC
## <none> 44.437 -63.723
## + pgg45 1 0.66071 43.776 -63.176
## + lcp 1 0.13040 44.306 -62.008
## + gleason 3 1.61580 42.821 -61.315
The result of the step function is the final model after no additional variable can enter.
summary(lforward)
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight + svi + lbph + age, data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.86717 -0.37725 0.01257 0.40252 1.44544
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.49473 0.87652 0.564 0.57385
## lcavol 0.54400 0.07463 7.289 1.11e-10 ***
## lweight 0.58821 0.19790 2.972 0.00378 **
## svi1 0.71490 0.20653 3.461 0.00082 ***
## lbph 0.10122 0.05759 1.758 0.08215 .
## age -0.01644 0.01068 -1.540 0.12692
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6988 on 91 degrees of freedom
## Multiple R-squared: 0.6526, Adjusted R-squared: 0.6335
## F-statistic: 34.19 on 5 and 91 DF, p-value: < 2.2e-16
The result is the final model of the forward method. Here it is the same as model lp3. We can see the sequence of models the method has derived using the anova field of the result:
lforward$anova
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 NA NA 96 127.91766 28.83755
## 2 + lcavol -1 69.0028748 95 58.91478 -44.36603
## 3 + lweight -1 7.1726081 94 51.74218 -54.95846
## 4 + svi -1 5.1737396 93 46.56844 -63.17744
## 5 + lbph -1 0.9729642 92 45.59547 -63.22555
## 6 + age -1 1.1587904 91 44.43668 -63.72263
To use the backward method we change the starting model to lfull and the direction to “backward”:
lbackward=step(lfull, scope=formula(lfull), direction = "backward")
## Start: AIC=-59.63
## lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason +
## pgg45
##
## Df Sum of Sq RSS AIC
## - gleason 3 1.2966 43.108 -62.668
## - pgg45 1 0.5469 42.358 -60.370
## - lcp 1 0.8442 42.655 -59.692
## <none> 41.811 -59.631
## - lbph 1 1.1919 43.003 -58.904
## - age 1 2.0386 43.850 -57.013
## - svi 1 4.5045 46.315 -51.706
## - lweight 1 4.7080 46.519 -51.281
## - lcavol 1 18.0908 59.902 -26.755
##
## Step: AIC=-62.67
## lpsa ~ lcavol + lweight + age + lbph + svi + lcp + pgg45
##
## Df Sum of Sq RSS AIC
## - lcp 1 0.6684 43.776 -63.176
## <none> 43.108 -62.668
## - pgg45 1 1.1987 44.306 -62.008
## - lbph 1 1.3844 44.492 -61.602
## - age 1 1.7579 44.865 -60.791
## - lweight 1 4.6429 47.751 -54.746
## - svi 1 4.8333 47.941 -54.360
## - lcavol 1 21.3191 64.427 -25.691
##
## Step: AIC=-63.18
## lpsa ~ lcavol + lweight + age + lbph + svi + pgg45
##
## Df Sum of Sq RSS AIC
## - pgg45 1 0.6607 44.437 -63.723
## <none> 43.776 -63.176
## - lbph 1 1.3329 45.109 -62.266
## - age 1 1.4878 45.264 -61.934
## - svi 1 4.1766 47.953 -56.336
## - lweight 1 4.6553 48.431 -55.373
## - lcavol 1 22.7555 66.531 -24.572
##
## Step: AIC=-63.72
## lpsa ~ lcavol + lweight + age + lbph + svi
##
## Df Sum of Sq RSS AIC
## <none> 44.437 -63.723
## - age 1 1.1588 45.595 -63.226
## - lbph 1 1.5087 45.945 -62.484
## - lweight 1 4.3140 48.751 -56.735
## - svi 1 5.8509 50.288 -53.724
## - lcavol 1 25.9427 70.379 -21.119
summary(lbackward)
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi, data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.86717 -0.37725 0.01257 0.40252 1.44544
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.49473 0.87652 0.564 0.57385
## lcavol 0.54400 0.07463 7.289 1.11e-10 ***
## lweight 0.58821 0.19790 2.972 0.00378 **
## age -0.01644 0.01068 -1.540 0.12692
## lbph 0.10122 0.05759 1.758 0.08215 .
## svi1 0.71490 0.20653 3.461 0.00082 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6988 on 91 degrees of freedom
## Multiple R-squared: 0.6526, Adjusted R-squared: 0.6335
## F-statistic: 34.19 on 5 and 91 DF, p-value: < 2.2e-16
lbackward$anova
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 NA NA 86 41.81094 -59.63064
## 2 - gleason 3 1.2966215 89 43.10756 -62.66823
## 3 - lcp 1 0.6684160 90 43.77597 -63.17571
## 4 - pgg45 1 0.6607078 91 44.43668 -63.72263
For stepwise we start again with lp0 and change the direction to “both”
lstep=step(lp0, scope=formula(lfull), direction = "both")
## Start: AIC=28.84
## lpsa ~ 1
##
## Df Sum of Sq RSS AIC
## + lcavol 1 69.003 58.915 -44.366
## + svi 1 41.011 86.907 -6.658
## + lcp 1 38.528 89.389 -3.926
## + gleason 3 30.548 97.370 8.369
## + lweight 1 24.019 103.899 10.665
## + pgg45 1 22.814 105.103 11.783
## + lbph 1 4.136 123.782 27.650
## + age 1 3.679 124.239 28.007
## <none> 127.918 28.838
##
## Step: AIC=-44.37
## lpsa ~ lcavol
##
## Df Sum of Sq RSS AIC
## + lweight 1 7.173 51.742 -54.958
## + svi 1 5.237 53.677 -51.397
## + lbph 1 3.266 55.649 -47.898
## + pgg45 1 1.698 57.217 -45.203
## <none> 58.915 -44.366
## + lcp 1 0.656 58.259 -43.452
## + gleason 3 2.840 56.075 -43.158
## + age 1 0.003 58.912 -42.370
## - lcavol 1 69.003 127.918 28.838
##
## Step: AIC=-54.96
## lpsa ~ lcavol + lweight
##
## Df Sum of Sq RSS AIC
## + svi 1 5.174 46.568 -63.177
## + pgg45 1 1.816 49.926 -56.424
## <none> 51.742 -54.958
## + lcp 1 0.819 50.923 -54.506
## + gleason 3 2.799 48.943 -54.353
## + age 1 0.646 51.097 -54.176
## + lbph 1 0.444 51.298 -53.794
## - lweight 1 7.173 58.915 -44.366
## - lcavol 1 52.157 103.899 10.665
##
## Step: AIC=-63.18
## lpsa ~ lcavol + lweight + svi
##
## Df Sum of Sq RSS AIC
## + lbph 1 0.9730 45.595 -63.226
## <none> 46.568 -63.177
## + age 1 0.6230 45.945 -62.484
## + pgg45 1 0.5007 46.068 -62.226
## + lcp 1 0.0694 46.499 -61.322
## + gleason 3 1.6414 44.927 -60.658
## - svi 1 5.1737 51.742 -54.958
## - lweight 1 7.1089 53.677 -51.397
## - lcavol 1 24.7058 71.274 -23.893
##
## Step: AIC=-63.23
## lpsa ~ lcavol + lweight + svi + lbph
##
## Df Sum of Sq RSS AIC
## + age 1 1.1588 44.437 -63.723
## <none> 45.595 -63.226
## - lbph 1 0.9730 46.568 -63.177
## + pgg45 1 0.3317 45.264 -61.934
## + lcp 1 0.1012 45.494 -61.441
## + gleason 3 1.2635 44.332 -59.951
## - lweight 1 3.6907 49.286 -57.675
## - svi 1 5.7027 51.298 -53.794
## - lcavol 1 24.9384 70.534 -22.906
##
## Step: AIC=-63.72
## lpsa ~ lcavol + lweight + svi + lbph + age
##
## Df Sum of Sq RSS AIC
## <none> 44.437 -63.723
## - age 1 1.1588 45.595 -63.226
## + pgg45 1 0.6607 43.776 -63.176
## - lbph 1 1.5087 45.945 -62.484
## + lcp 1 0.1304 44.306 -62.008
## + gleason 3 1.6158 42.821 -61.315
## - lweight 1 4.3140 48.751 -56.735
## - svi 1 5.8509 50.288 -53.724
## - lcavol 1 25.9427 70.379 -21.119
summary(lstep)
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight + svi + lbph + age, data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.86717 -0.37725 0.01257 0.40252 1.44544
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.49473 0.87652 0.564 0.57385
## lcavol 0.54400 0.07463 7.289 1.11e-10 ***
## lweight 0.58821 0.19790 2.972 0.00378 **
## svi1 0.71490 0.20653 3.461 0.00082 ***
## lbph 0.10122 0.05759 1.758 0.08215 .
## age -0.01644 0.01068 -1.540 0.12692
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6988 on 91 degrees of freedom
## Multiple R-squared: 0.6526, Adjusted R-squared: 0.6335
## F-statistic: 34.19 on 5 and 91 DF, p-value: < 2.2e-16
lstep$anova
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 NA NA 96 127.91766 28.83755
## 2 + lcavol -1 69.0028748 95 58.91478 -44.36603
## 3 + lweight -1 7.1726081 94 51.74218 -54.95846
## 4 + svi -1 5.1737396 93 46.56844 -63.17744
## 5 + lbph -1 0.9729642 92 45.59547 -63.22555
## 6 + age -1 1.1587904 91 44.43668 -63.72263
In this example all methods have converged to the same final model, however this is not generally the case. The starting model and the
References