Model Selection

We will use the prostate cancer data set from the book Hastie et al. (2009), which investigates correlations of the PSA score (response variable) in prostate cancer patients with a number of possible factors.

load("prostate.Rdata")
names(prostate)

##  [1] "X"       "lcavol"  "lweight" "age"     "lbph"    "svi"     "lcp"    
##  [8] "gleason" "pgg45"   "lpsa"

The variable descriptions are as follows: lpsa: log PSA score lcavol: log cancer volume lweight: log prostate weight age: age of patient lbph: log of the amount of benign prostatic hyperplasia svi: seminal vesicle invasion lcp: log of capsular penetration gleason: Gleason score pgg45: percent of Gleason scores 4 or 5

We treat svi and gleason as categorical variables with 2 and 4 levels, respectively.

prostate$svi=as.factor(prostate$svi)
prostate$gleason=as.factor(prostate$gleason)

Model Comparisons

We fit a main effects model with all variables. We consider this as the full model.

lfull=lm(lpsa~lcavol+lweight+age+lbph+svi+lcp+gleason+pgg45, data=prostate)
lfulls=summary(lfull)
lfulls

## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi + lcp + 
##     gleason + pgg45, data = prostate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.77956 -0.31691 -0.04774  0.44278  1.52832 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.486274   0.885929   0.549  0.58451    
## lcavol       0.549532   0.090087   6.100 2.94e-08 ***
## lweight      0.623816   0.200463   3.112  0.00252 ** 
## age         -0.023103   0.011282  -2.048  0.04364 *  
## lbph         0.091523   0.058454   1.566  0.12108    
## svi1         0.744537   0.244602   3.044  0.00310 ** 
## lcp         -0.124612   0.094565  -1.318  0.19109    
## gleason7     0.253874   0.216141   1.175  0.24341    
## gleason8     0.480520   0.760895   0.632  0.52938    
## gleason9    -0.036003   0.494866  -0.073  0.94217    
## pgg45        0.004902   0.004622   1.061  0.29184    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6973 on 86 degrees of freedom
## Multiple R-squared:  0.6731, Adjusted R-squared:  0.6351 
## F-statistic: 17.71 on 10 and 86 DF,  p-value: < 2.2e-16

We also consider three partial models: (i) Model lp1 includes variables lcavol, lweight, pgg45; (ii) Model lp2 includes variables lcavol, lweight, age, svi1; (ii) Model lp3 includes variables lcavol, lweight, age, svi1, lbph.

lp1=lm(lpsa~lcavol+lweight+pgg45, data=prostate)
lp1s=summary(lp1)
lp2=lm(lpsa~lcavol+lweight+age+svi, data=prostate)
lp2s=summary(lp2)
lp3=lm(lpsa~lcavol+lweight+age+svi+lbph, data=prostate)
lp3s=summary(lp3)

We find the values of \(R^2\) and adjusted \(R^2\) for each model, including the full:

c(lfulls$r.squared, lfulls$adj.r.squared)

## [1] 0.6731418 0.6351350

c(lp1s$r.squared, lp1s$adj.r.squared)

## [1] 0.6096989 0.5971086

c(lp2s$r.squared, lp2s$adj.r.squared)

## [1] 0.6408203 0.6252038

c(lp3s$r.squared, lp3s$adj.r.squared)

## [1] 0.6526150 0.6335279

In our example lp1 is nested to lfull, and lp2 is nested to lp3 and lp3 nested to lfull.

Comparisons based on \(R^2\) are suitable if models are of the same size. For nested models they do not make sense because the larger model always has greater value of \(R^2\).

Comparisons based on adjusted \(R^2\) are relevant between nested models. In our example, the comparison between lp1 and lfull shows that adjusted \(R^2\) in lp1 is significantly lower than in lfull, indicating that lp1 is not sufficient for fitting. A partial F-test between lp1 and lfull also gives the same result:

anova(lp1, lfull)

## Analysis of Variance Table
## 
## Model 1: lpsa ~ lcavol + lweight + pgg45
## Model 2: lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + 
##     pgg45
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     93 49.926                              
## 2     86 41.811  7    8.1155 2.3846 0.02812 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis is rejected, i.e., there is significant evidence that at least one of the terms in lfull that are not included in lp1 is statistically significant.

The comparison between lp2, lp3 and lfull shows that still lfull has the largest adjusted \(R^2\) among the three, however the differences between the three models are small, and it is questionable whether the full model is significantly better. The partial F-tests now do not reject the null hypothesis:

anova(lp2, lfull)

## Analysis of Variance Table
## 
## Model 1: lpsa ~ lcavol + lweight + age + svi
## Model 2: lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + 
##     pgg45
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     92 45.945                           
## 2     86 41.811  6    4.1345 1.4174 0.2173

anova(lp3, lfull)

## Analysis of Variance Table
## 
## Model 1: lpsa ~ lcavol + lweight + age + svi + lbph
## Model 2: lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + 
##     pgg45
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     91 44.437                           
## 2     86 41.811  5    2.6258 1.0802 0.3771

An alternative measure of comparison, which is applicable even among non-nested models, is AIC (Akaike Information Criterion). This is also applicable for generalized linear models, since it is based on maximum-likelihood estimation (for linear models we know that MLE and LSE are equivalent). It is defined as \(AIC = 2 k - 2 \log(\hat{L})\), where \(k\) is the number of estimated parameters in the model and \(\hat{L}\) is the maximum value of the likelihood function achieved after applying ML estimation. Since we want large values of the likelihood for better fit, the value of \(- 2 \log(\hat{L})\) should be small. We see that AIC penalizes the likelihood with the number of parameters of the model, in order to avoid overfitting. In general small values of AIC indicate a better model.

We can find the AIC value for a model using the AIC function:

AIC(lfull)

## [1] 217.6434

We can also call the function with many models as arguments, in which case it returns a comparison table with the AIC values for each model:

AIC(lp1, lp2, lp3, lfull)

##       df      AIC
## lp1    5 220.8505
## lp2    6 214.7902
## lp3    7 213.5514
## lfull 12 217.6434

Note that the degrees of freedom in the above table does not correspond to the usual definition of degrees of freedom in the linear model. It comes from an alternative definition which is more relevant in machine learning and we will not examine it here.

We see that based on AIC model lp3 is the most appropriate between the four models considered, since it shows the best combination between fit (likelihood) and complexity (number of parameters).

As a final measure of comparison, which is appropriate for comparing two nested models, we will consider Mallows’ Cp-statistic. We can compute it using the ols_mallows_cp function which is part of the olsrr library (it can be installed using the command install.packages(“olsrr”) ):

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_mallows_cp(lp1, lfull)

## [1] 13.69252

For practicing with R programming, we can also create our own function easily.

cpmallows=function(partial, full)
{
  # Computes the Cp-Mallows statistic between a partial and a full model. 
  # Cp=SSEpartial/MSEfull - (n-2*(p+1))
  # SSEpartial is the SSE for the partial model
  # MSEfull is the SSE of the full model
  # n=sample size
  # p+1 number of estimated parameters of partial model 
  SSEpartial=sum(partial$residuals^2)
  MSEfull=summary(full)$sigma^2
  # We can compute the sample size n as the sum of the degrees of freedom of error and the length of the coefficients vector. This will be the same for both models. 
  n=length(full$coefficients)+full$df.residual
  p1=length(partial$coefficients)
  return(SSEpartial/MSEfull-(n-2*p1))
}

Both function give the same result:

ols_mallows_cp(lp1, lfull)

## [1] 13.69252

cpmallows(lp1, lfull)

## [1] 13.69252

We know that the partial model is sufficient with respect to the full, if the value of Cp is close to the number of parameters in the partial model. If it is larger, then the partial model is not sufficient. Here the number of parameters in lp1 is 4 and the value of Cp=13.7, thus lp1 is not a good model compared to lfull. For the other two models we obtain:

cpmallows(lp2, lfull)

## [1] 7.504144

cpmallows(lp3, lfull)

## [1] 6.400838

For lp2, which has size 5, Cp is still large. However for lp3 which has size 6, the value of Cp=6.4 is sufficiently low to consider lp3 as a good alternative to the full model.

Stepwise Regression

Stepwise regression refers to an approach for building regression models in a successive manner using criteria of fit and model size. In R stepwise regression is performed using the step() function. The criterion for including or dropping a variable is the value of AIC. Specifically, in the forward method, to decide whether some variable not present in the model should be included, the value of AIC for the model with the new variable present is computed and if it is smaller than the original model, then the variable is a candidate for inclusion. The variable that brings the largest reduction in AIC is selected to enter the model. A similar logic is followed in the backward method for selecting a variable to drop and in stepwise that combines the two approaches.

For the step function we must specify the original model to start from when applying the forward method, the full model (using the option scope and the formula of the full model, i.e. scope=formula(lfull)) and the method (using the option direction with values “forward,” “backward” or “both”).

In our example we start with an empty model with no variables :

lp0=lm(lpsa~1, data=prostate)

To perform a forward method:

lforward=step(lp0, scope=formula(lfull), direction = "forward")

## Start:  AIC=28.84
## lpsa ~ 1
## 
##           Df Sum of Sq     RSS     AIC
## + lcavol   1    69.003  58.915 -44.366
## + svi      1    41.011  86.907  -6.658
## + lcp      1    38.528  89.389  -3.926
## + gleason  3    30.548  97.370   8.369
## + lweight  1    24.019 103.899  10.665
## + pgg45    1    22.814 105.103  11.783
## + lbph     1     4.136 123.782  27.650
## + age      1     3.679 124.239  28.007
## <none>                 127.918  28.838
## 
## Step:  AIC=-44.37
## lpsa ~ lcavol
## 
##           Df Sum of Sq    RSS     AIC
## + lweight  1    7.1726 51.742 -54.958
## + svi      1    5.2375 53.677 -51.397
## + lbph     1    3.2658 55.649 -47.898
## + pgg45    1    1.6980 57.217 -45.203
## <none>                 58.915 -44.366
## + lcp      1    0.6562 58.259 -43.452
## + gleason  3    2.8395 56.075 -43.158
## + age      1    0.0025 58.912 -42.370
## 
## Step:  AIC=-54.96
## lpsa ~ lcavol + lweight
## 
##           Df Sum of Sq    RSS     AIC
## + svi      1    5.1737 46.568 -63.177
## + pgg45    1    1.8158 49.926 -56.424
## <none>                 51.742 -54.958
## + lcp      1    0.8187 50.923 -54.506
## + gleason  3    2.7992 48.943 -54.353
## + age      1    0.6456 51.097 -54.176
## + lbph     1    0.4440 51.298 -53.794
## 
## Step:  AIC=-63.18
## lpsa ~ lcavol + lweight + svi
## 
##           Df Sum of Sq    RSS     AIC
## + lbph     1   0.97296 45.595 -63.226
## <none>                 46.568 -63.177
## + age      1   0.62301 45.945 -62.484
## + pgg45    1   0.50069 46.068 -62.226
## + lcp      1   0.06937 46.499 -61.322
## + gleason  3   1.64140 44.927 -60.658
## 
## Step:  AIC=-63.23
## lpsa ~ lcavol + lweight + svi + lbph
## 
##           Df Sum of Sq    RSS     AIC
## + age      1   1.15879 44.437 -63.723
## <none>                 45.595 -63.226
## + pgg45    1   0.33173 45.264 -61.934
## + lcp      1   0.10115 45.494 -61.441
## + gleason  3   1.26345 44.332 -59.951
## 
## Step:  AIC=-63.72
## lpsa ~ lcavol + lweight + svi + lbph + age
## 
##           Df Sum of Sq    RSS     AIC
## <none>                 44.437 -63.723
## + pgg45    1   0.66071 43.776 -63.176
## + lcp      1   0.13040 44.306 -62.008
## + gleason  3   1.61580 42.821 -61.315

The result of the step function is the final model after no additional variable can enter.

summary(lforward)

## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight + svi + lbph + age, data = prostate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.86717 -0.37725  0.01257  0.40252  1.44544 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.49473    0.87652   0.564  0.57385    
## lcavol       0.54400    0.07463   7.289 1.11e-10 ***
## lweight      0.58821    0.19790   2.972  0.00378 ** 
## svi1         0.71490    0.20653   3.461  0.00082 ***
## lbph         0.10122    0.05759   1.758  0.08215 .  
## age         -0.01644    0.01068  -1.540  0.12692    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6988 on 91 degrees of freedom
## Multiple R-squared:  0.6526, Adjusted R-squared:  0.6335 
## F-statistic: 34.19 on 5 and 91 DF,  p-value: < 2.2e-16

The result is the final model of the forward method. Here it is the same as model lp3. We can see the sequence of models the method has derived using the anova field of the result:

lforward$anova

##        Step Df   Deviance Resid. Df Resid. Dev       AIC
## 1           NA         NA        96  127.91766  28.83755
## 2  + lcavol -1 69.0028748        95   58.91478 -44.36603
## 3 + lweight -1  7.1726081        94   51.74218 -54.95846
## 4     + svi -1  5.1737396        93   46.56844 -63.17744
## 5    + lbph -1  0.9729642        92   45.59547 -63.22555
## 6     + age -1  1.1587904        91   44.43668 -63.72263

To use the backward method we change the starting model to lfull and the direction to “backward”:

lbackward=step(lfull, scope=formula(lfull), direction = "backward")

## Start:  AIC=-59.63
## lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + 
##     pgg45
## 
##           Df Sum of Sq    RSS     AIC
## - gleason  3    1.2966 43.108 -62.668
## - pgg45    1    0.5469 42.358 -60.370
## - lcp      1    0.8442 42.655 -59.692
## <none>                 41.811 -59.631
## - lbph     1    1.1919 43.003 -58.904
## - age      1    2.0386 43.850 -57.013
## - svi      1    4.5045 46.315 -51.706
## - lweight  1    4.7080 46.519 -51.281
## - lcavol   1   18.0908 59.902 -26.755
## 
## Step:  AIC=-62.67
## lpsa ~ lcavol + lweight + age + lbph + svi + lcp + pgg45
## 
##           Df Sum of Sq    RSS     AIC
## - lcp      1    0.6684 43.776 -63.176
## <none>                 43.108 -62.668
## - pgg45    1    1.1987 44.306 -62.008
## - lbph     1    1.3844 44.492 -61.602
## - age      1    1.7579 44.865 -60.791
## - lweight  1    4.6429 47.751 -54.746
## - svi      1    4.8333 47.941 -54.360
## - lcavol   1   21.3191 64.427 -25.691
## 
## Step:  AIC=-63.18
## lpsa ~ lcavol + lweight + age + lbph + svi + pgg45
## 
##           Df Sum of Sq    RSS     AIC
## - pgg45    1    0.6607 44.437 -63.723
## <none>                 43.776 -63.176
## - lbph     1    1.3329 45.109 -62.266
## - age      1    1.4878 45.264 -61.934
## - svi      1    4.1766 47.953 -56.336
## - lweight  1    4.6553 48.431 -55.373
## - lcavol   1   22.7555 66.531 -24.572
## 
## Step:  AIC=-63.72
## lpsa ~ lcavol + lweight + age + lbph + svi
## 
##           Df Sum of Sq    RSS     AIC
## <none>                 44.437 -63.723
## - age      1    1.1588 45.595 -63.226
## - lbph     1    1.5087 45.945 -62.484
## - lweight  1    4.3140 48.751 -56.735
## - svi      1    5.8509 50.288 -53.724
## - lcavol   1   25.9427 70.379 -21.119

summary(lbackward)

## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi, data = prostate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.86717 -0.37725  0.01257  0.40252  1.44544 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.49473    0.87652   0.564  0.57385    
## lcavol       0.54400    0.07463   7.289 1.11e-10 ***
## lweight      0.58821    0.19790   2.972  0.00378 ** 
## age         -0.01644    0.01068  -1.540  0.12692    
## lbph         0.10122    0.05759   1.758  0.08215 .  
## svi1         0.71490    0.20653   3.461  0.00082 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6988 on 91 degrees of freedom
## Multiple R-squared:  0.6526, Adjusted R-squared:  0.6335 
## F-statistic: 34.19 on 5 and 91 DF,  p-value: < 2.2e-16

lbackward$anova

##        Step Df  Deviance Resid. Df Resid. Dev       AIC
## 1           NA        NA        86   41.81094 -59.63064
## 2 - gleason  3 1.2966215        89   43.10756 -62.66823
## 3     - lcp  1 0.6684160        90   43.77597 -63.17571
## 4   - pgg45  1 0.6607078        91   44.43668 -63.72263

For stepwise we start again with lp0 and change the direction to “both”

lstep=step(lp0, scope=formula(lfull), direction = "both")

## Start:  AIC=28.84
## lpsa ~ 1
## 
##           Df Sum of Sq     RSS     AIC
## + lcavol   1    69.003  58.915 -44.366
## + svi      1    41.011  86.907  -6.658
## + lcp      1    38.528  89.389  -3.926
## + gleason  3    30.548  97.370   8.369
## + lweight  1    24.019 103.899  10.665
## + pgg45    1    22.814 105.103  11.783
## + lbph     1     4.136 123.782  27.650
## + age      1     3.679 124.239  28.007
## <none>                 127.918  28.838
## 
## Step:  AIC=-44.37
## lpsa ~ lcavol
## 
##           Df Sum of Sq     RSS     AIC
## + lweight  1     7.173  51.742 -54.958
## + svi      1     5.237  53.677 -51.397
## + lbph     1     3.266  55.649 -47.898
## + pgg45    1     1.698  57.217 -45.203
## <none>                  58.915 -44.366
## + lcp      1     0.656  58.259 -43.452
## + gleason  3     2.840  56.075 -43.158
## + age      1     0.003  58.912 -42.370
## - lcavol   1    69.003 127.918  28.838
## 
## Step:  AIC=-54.96
## lpsa ~ lcavol + lweight
## 
##           Df Sum of Sq     RSS     AIC
## + svi      1     5.174  46.568 -63.177
## + pgg45    1     1.816  49.926 -56.424
## <none>                  51.742 -54.958
## + lcp      1     0.819  50.923 -54.506
## + gleason  3     2.799  48.943 -54.353
## + age      1     0.646  51.097 -54.176
## + lbph     1     0.444  51.298 -53.794
## - lweight  1     7.173  58.915 -44.366
## - lcavol   1    52.157 103.899  10.665
## 
## Step:  AIC=-63.18
## lpsa ~ lcavol + lweight + svi
## 
##           Df Sum of Sq    RSS     AIC
## + lbph     1    0.9730 45.595 -63.226
## <none>                 46.568 -63.177
## + age      1    0.6230 45.945 -62.484
## + pgg45    1    0.5007 46.068 -62.226
## + lcp      1    0.0694 46.499 -61.322
## + gleason  3    1.6414 44.927 -60.658
## - svi      1    5.1737 51.742 -54.958
## - lweight  1    7.1089 53.677 -51.397
## - lcavol   1   24.7058 71.274 -23.893
## 
## Step:  AIC=-63.23
## lpsa ~ lcavol + lweight + svi + lbph
## 
##           Df Sum of Sq    RSS     AIC
## + age      1    1.1588 44.437 -63.723
## <none>                 45.595 -63.226
## - lbph     1    0.9730 46.568 -63.177
## + pgg45    1    0.3317 45.264 -61.934
## + lcp      1    0.1012 45.494 -61.441
## + gleason  3    1.2635 44.332 -59.951
## - lweight  1    3.6907 49.286 -57.675
## - svi      1    5.7027 51.298 -53.794
## - lcavol   1   24.9384 70.534 -22.906
## 
## Step:  AIC=-63.72
## lpsa ~ lcavol + lweight + svi + lbph + age
## 
##           Df Sum of Sq    RSS     AIC
## <none>                 44.437 -63.723
## - age      1    1.1588 45.595 -63.226
## + pgg45    1    0.6607 43.776 -63.176
## - lbph     1    1.5087 45.945 -62.484
## + lcp      1    0.1304 44.306 -62.008
## + gleason  3    1.6158 42.821 -61.315
## - lweight  1    4.3140 48.751 -56.735
## - svi      1    5.8509 50.288 -53.724
## - lcavol   1   25.9427 70.379 -21.119

summary(lstep)

## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight + svi + lbph + age, data = prostate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.86717 -0.37725  0.01257  0.40252  1.44544 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.49473    0.87652   0.564  0.57385    
## lcavol       0.54400    0.07463   7.289 1.11e-10 ***
## lweight      0.58821    0.19790   2.972  0.00378 ** 
## svi1         0.71490    0.20653   3.461  0.00082 ***
## lbph         0.10122    0.05759   1.758  0.08215 .  
## age         -0.01644    0.01068  -1.540  0.12692    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6988 on 91 degrees of freedom
## Multiple R-squared:  0.6526, Adjusted R-squared:  0.6335 
## F-statistic: 34.19 on 5 and 91 DF,  p-value: < 2.2e-16

lstep$anova

##        Step Df   Deviance Resid. Df Resid. Dev       AIC
## 1           NA         NA        96  127.91766  28.83755
## 2  + lcavol -1 69.0028748        95   58.91478 -44.36603
## 3 + lweight -1  7.1726081        94   51.74218 -54.95846
## 4     + svi -1  5.1737396        93   46.56844 -63.17744
## 5    + lbph -1  0.9729642        92   45.59547 -63.22555
## 6     + age -1  1.1587904        91   44.43668 -63.72263

In this example all methods have converged to the same final model, however this is not generally the case. The starting model and the

References

Hastie, Trevor, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Vol. 2. Springer.