Regression Diagnostics

Model

We will use the dataset exercise1:

load("exercise1.Rdata")
names(exercise1)

## [1] "age"    "sex"    "extime" "score"

We fit a main effects model

lmain=lm(score~age+sex+extime, data=exercise1)
lsmain=summary(lmain)
lmain

## 
## Call:
## lm(formula = score ~ age + sex + extime, data = exercise1)
## 
## Coefficients:
## (Intercept)          age         sexM       extime  
##   2147.5217     -30.6958      58.9874       0.4429

lsmain

## 
## Call:
## lm(formula = score ~ age + sex + extime, data = exercise1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -525.80 -103.74   -2.96  107.57  573.36 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 2147.5217   596.1802   3.602  0.00109 **
## age          -30.6958     9.4443  -3.250  0.00278 **
## sexM          58.9874    68.3227   0.863  0.39457   
## extime         0.4429     0.4534   0.977  0.33623   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 189.2 on 31 degrees of freedom
## Multiple R-squared:  0.7734, Adjusted R-squared:  0.7515 
## F-statistic: 35.28 on 3 and 31 DF,  p-value: 4.084e-10

We observe that sex and exercise time are not statistically significant in the presence of age, so in order to keep the analysis simpler we restrict ourselves to the single regression model between score and age:

l1=lm(score~age, data=exercise1)
ls1=summary(l1)
l1

## 
## Call:
## lm(formula = score ~ age, data = exercise1)
## 
## Coefficients:
## (Intercept)          age  
##     2789.02       -40.34

ls1

## 
## Call:
## lm(formula = score ~ age, data = exercise1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -524.56  -95.55  -28.86   92.96  600.75 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2789.020    178.184   15.65  < 2e-16 ***
## age          -40.336      3.953  -10.20 9.75e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 189 on 33 degrees of freedom
## Multiple R-squared:  0.7593, Adjusted R-squared:  0.752 
## F-statistic: 104.1 on 1 and 33 DF,  p-value: 9.755e-12

Residuals

The residuals \(y_j-\hat{y}_j\) are included in the output, and so are the fitted values \(\hat{y}_j\). A first plot of residuals against fitted values can be created:

plot(l1$residuals~l1$fitted.values)

The residuals are expressed in the same units as the dependent variable, thus are scale dependent. In addition they are not independent (their mean is zero) and do not have the same variance. To alleviate these problems we can use either the studentized of the jacknife residuals. The studentized are computed with the rstandard function and the jacknife with the rstudent function:

rstu=rstandard(l1)
rjacknife=rstudent(l1)

We create the plot of jacknife residuals against fitted values:

plot(rjacknife~l1$fitted.values)

These residuals are independent of scale and are more appropriate for diagnostic tests.

Diagnostic tests

From the diagram of the residuals against fitted values we can often detect violations of the regression assumptions. If all assumptions are satisfied, we expect that the diagram will not show any systematic pattern or behavior. Instead all residuals should be randomly distributed around zero with the same variance.

In the previous diagram we do not observe any suspicious behavior, except for some observations that may be outliers or influential points, which we will discuss later. In particlular, the variance of the residuals seems to be constant and there is no departure from linearity (i.e., no systematic behavior).

Regarding normality, it is not easy to verify from the residual plot. We can use either a qplot or a normality test:

qqnorm(rjacknife)
abline(0,1)

From the qplot we see that most residuals fall on the diagonal line except for some extreme points which correspond to the outliers. We can also perform a Kolmogorov-Smirnov test for normality. The following test the jacknife residuals against the probability density function of the standard normal distribution:

ks.test(rjacknife, "pnorm", 0, 1)

## 
##  Exact one-sample Kolmogorov-Smirnov test
## 
## data:  rjacknife
## D = 0.1203, p-value = 0.6478
## alternative hypothesis: two-sided

Based on the test, the assumption of normality cannot be rejected.

The independence of the residuals refers to the absence of autocorrelation, i.e., no correlation between the residuals of successive observations, when the observations are in time sequence. Such cases arise in time-series where observations are taken in successive time points and consecutive observations may be correlated (for example in environmental datasets such as time series of atmospheric pollution and hospital admissions due to respiratory problems). If autocorrelation is present, then time-series models must be applied instead of regression.

To check for autocorrelation of residuals, we can make a time-series plot of the jacknife residuals, i.e., against the observation index, and not against fitted values:

plot(rjacknife, type='l')

In the absence of autocorrelation, we expect random variations without any evident pattern.

We can also perform a Durbin-Watson test for autocorrelation. This test is part of the lmtest library, which can be installed using the command install.packages(“lmtest”):

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

dwtest(l1)

## 
##  Durbin-Watson test
## 
## data:  l1
## DW = 1.8628, p-value = 0.3391
## alternative hypothesis: true autocorrelation is greater than 0

We can see that the Durbin-Watson statistic has a value close to 2, which corresponds to no autocorrelation, and based on the p-value the null-hypothesis of no autocorrelation cannot be rejected.

Examples of Assumption Violations

In the following examples we will fit other datasets which exhibit violations of some of the regression assumptions:

Example 1: Heteroscedasticity

We use the dataframe d1 from file d1.Rdata:

load("d1.Rdata")
names(d1)

## [1] "age" "y"

We fit a single-factor model of y against age and create a residual plot of jacknife residuals against fitted values:

lh=lm(y~age, data=d1)
rjh=rstudent(lh)
plot(rjh~lh$fitted.values)

From the plot it is evident that the variance of the residuals is not constant. In particular it seems that the variance is increasing with the value of y. We can perform a Bartlett test for equality of variances. This tests whether different groups of a sample have equal variances. To do the test here we create a categorical variable g with three levels corresponding to \(\hat{y} \leq 80, 80 <\hat{y} \leq 110\) and \(\hat{y} >140\). This can be done easily using the cut function:

g=cut(lh$fitted.values, breaks=c(0, 80, 110, 200), labels=c("A", "B", "C"))
g

##   [1] B C A C B B C B B A A C C B C A C B B A C B B C C C B B C C C C B C A C C
##  [38] B A A C B A B B B B A C A B A A A A B C C A C B C C A C C B A C C A A C C
##  [75] A C B C A C C A B B A A A C B B A C A B C C A B B B A C B B C C A B B C B
## [112] C B B A B B C B A B B C C B C C A C B A B B A C C B C A C B B C A A C C A
## [149] C C A C B A C B A B C B A B A C C A B A A A C B C C C B B A C B C C A C B
## [186] B A B C B A C C C C C B C A C
## Levels: A B C

We can split the jacknife residuals vector into three groups according to the value of g and create a boxplot for each group:

rjA=rjh[g=="A"]
rjB=rjh[g=="B"]
rjC=rjh[g=="C"]
boxplot(rjA, rjB, rjC)

We observe that the three groups have different variances. We can perform a Bartlett test that tests equality of the variances of the three groups:

bartlett.test(rjh~g)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  rjh by g
## Bartlett's K-squared = 44.208, df = 2, p-value = 2.514e-10

Based on the p-value, the null hypothesis of homogeneity of variances in the groups based on g is rejected.

Since homoscedasticity is rejected, we must see how to rectify it. One way to do this may be using some transformation of the dependent variable. Since the variance is increasing in \(y\), an appropriate transformation must be a concave function, which reduced the higher values proportionally more than the smaller ones. We try a square root transformation. This means that we create a new variable \(y1=\sqrt{y}\) and fit a new model with \(y_1\) as dependent variable:

d1$y1=(d1$y)^(1/2)
lh1=lm(y1~age, data=d1)
summary(lh1)

## 
## Call:
## lm(formula = y1 ~ age, data = d1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.62895 -0.29909 -0.00664  0.27708  1.20980 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.191364   0.172898   18.46   <2e-16 ***
## age         0.151059   0.003813   39.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4598 on 198 degrees of freedom
## Multiple R-squared:  0.888,  Adjusted R-squared:  0.8874 
## F-statistic:  1569 on 1 and 198 DF,  p-value: < 2.2e-16

rjh1=rstudent(lh1)
plot(rjh1~lh1$fitted.values)

The variance seems to be closer to homogeneous than before, but it is not very clear. If we perform a Bartlett test again with two groups, \(\hat{y}_1 \leq 4.6\) and \(\hat{y}_1 > 4.6\) we see that the null hypothesis is still rejected.

g1=lh1$fitted.values<=10
bartlett.test(rjh1~g1)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  rjh1 by g1
## Bartlett's K-squared = 14.235, df = 1, p-value = 0.0001614

A steeper transformation may be tried, for example \(y_2=y^{1/10}\)

d1$y2=(d1$y)^(1/10)
lh2=lm(y2~age, data=d1)
summary(lh2)

## 
## Call:
## lm(formula = y2 ~ age, data = d1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.05040 -0.01027 -0.00057  0.00847  0.03690 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.3632681  0.0054994   247.9   <2e-16 ***
## age         0.0048638  0.0001213    40.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01463 on 198 degrees of freedom
## Multiple R-squared:  0.8904, Adjusted R-squared:  0.8898 
## F-statistic:  1608 on 1 and 198 DF,  p-value: < 2.2e-16

rjh2=rstudent(lh2)
plot(rjh2~lh2$fitted.values)

g2=lh2$fitted.values<=1.6
bartlett.test(rjh2~g2)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  rjh2 by g2
## Bartlett's K-squared = 3.5905, df = 1, p-value = 0.05811

Now the Bartlett test is not rejected. It should be kept in mind that using transformations of the dependent variable makes the interpetation of the regression coefficients rather meaningless.

Other ways to deal with heteroscedasticity are to use more general methods for parameter estimation that take into account different variance among observations. Such methods are weighted least squares regression or robust regression.

Example 2: Nonlinearity

We use the dataframe d2 from file d2.Rdata:

load("d2.Rdata")
names(d2)

## [1] "age" "y"

We fit a single-factor model of y against age and create a residual plot of jacknife residuals against fitted values:

l21=lm(y~age, data=d2)
summary(l21)

## 
## Call:
## lm(formula = y ~ age, data = d2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -75.009 -18.850   4.966  21.601  45.871 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  55.2139    10.5196   5.249 3.93e-07 ***
## age           0.3972     0.2332   1.703   0.0901 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.43 on 198 degrees of freedom
## Multiple R-squared:  0.01444,    Adjusted R-squared:  0.009463 
## F-statistic: 2.901 on 1 and 198 DF,  p-value: 0.09008

rj21=rstudent(l21)
plot(rj21~l21$fitted.values)

A first observation is that the model is not statistically significant, thus it seems that age and y are not correlated. However the residual plot shows a nonrandom pattern that is suspicious. We create a scatter plot of the variables together with the regression line

plot(y~age, data=d2)
abline(l21$coefficients)

We see that there is definitely a relationship between age and y, however it is not linear. In fact if we fit a linear regression model as above, the best fit is an almost horizontal line which does not capture the true relationship. Such a model has bias, i.e., the estimated regression equation has a systematic error with respect to the true regression function. From the scatter plot we see that we should include nonlinear terms in the model. We try a quadratic term by changing the model formula to poly(age, 2) which means a general 2-degree polynomial. This can be used to fit polynomials of any degree.

l22=lm(y~poly(age, degree=2), data=d2)
summary(l22)

## 
## Call:
## lm(formula = y ~ poly(age, degree = 2), data = d2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.848  -7.264  -0.011   7.942  35.519 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              72.8247     0.7724  94.285  < 2e-16 ***
## poly(age, degree = 2)1   46.7232    10.9232   4.277 2.95e-05 ***
## poly(age, degree = 2)2 -354.2347    10.9232 -32.430  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.92 on 197 degrees of freedom
## Multiple R-squared:  0.8445, Adjusted R-squared:  0.8429 
## F-statistic:   535 on 2 and 197 DF,  p-value: < 2.2e-16

The new model is statistically significant. We create again a scatter plot of the data and the regression function. Note that in the lines function the data on the horizontal axis must be in increasing order otherwise the plot does not make sense. We thus create a new data frame with the original data and the fitted values from model l22 and order it with respect to age.

d22=data.frame(d2$age, d2$y, l22$fitted.values)
names(d22)=c("age","y", "yh")
d22=d22[order(d22$age), ]
plot(y~age, data=d22)
lines(yh~age, data=d22, col="red")

Now the residual plot shows a typical no-pattern behavior without any evident violation of the regression assumptions.

plot(rstudent(l22)~l22$fitted.values)

Example 3: Autocorrelation of Residuals

We use the dataframe d3 from file d3.Rdata:

load("d3.Rdata")
names(d3)

## [1] "x" "y"

We fit a single-factor model of y against age and create a residual plot of jacknife residuals against fitted values:

l3=lm(y~x, data=d3)
summary(l3)

## 
## Call:
## lm(formula = y ~ x, data = d3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9525 -0.8764  0.0794  0.7520  4.2460 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.7576     0.5264   18.54   <2e-16 ***
## x             1.9971     0.0115  173.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.416 on 198 degrees of freedom
## Multiple R-squared:  0.9935, Adjusted R-squared:  0.9934 
## F-statistic: 3.014e+04 on 1 and 198 DF,  p-value: < 2.2e-16

rj3=rstudent(l3)
plot(rj3~l3$fitted.values)

The model is statistically significant and the residual plot does not indicate any obvious problem, so we may assume that everything is OK and proceed with interpretations. However, we are also given the information the data have been collected in a sequential fashion so that successive observations correspond to consecutive time points. In this case we should also check for autocorrelation, i.e., correlation between residuals at close time instances. One first way to check graphically is to create a plot of the residuals not against the fitted values but against the observation index, which is a surrogate of time.

plot(rj3, type = 'l')

The plot exchibits the common pattern of positive autocorrelation: the residuals behave randomly at a local small scale, however at a larger scale they are positively correlated in the sense that they tend to form groups of positive or negative consecutive values. To verify this, we perform the Durbin-Watson test. This test is also part of the lmtest library in R.

dwtest(l3, alternative = "two.sided")

## 
##  Durbin-Watson test
## 
## data:  l3
## DW = 0.70559, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is not 0

Since the p value is very small and the d-statistic is less than 2, we conclude that there is strong indication of positive autocorrelation of residuals.

We now use the dataframe d4 from file d4.Rdata.

We use the dataframe d3 from file d3.Rdata:

load("d4.Rdata")
names(d4)

## [1] "x" "y"

We fit a single-factor model of y against age and create a residual plot of jacknife residuals against fitted values:

l4=lm(y~x, data=d4)
summary(l4)

## 
## Call:
## lm(formula = y ~ x, data = d4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.4438  -2.2851  -0.2866   2.5181   9.1061 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.11201    1.48666   7.474 2.44e-12 ***
## x            1.96699    0.03249  60.539  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4 on 198 degrees of freedom
## Multiple R-squared:  0.9487, Adjusted R-squared:  0.9485 
## F-statistic:  3665 on 1 and 198 DF,  p-value: < 2.2e-16

rj4=rstudent(l4)
plot(rj4~l4$fitted.values)

We also create a time-series plot of the residuals

plot(rj4, type='l')

This plot also indicates autocorrelation of the residuals, however now it is negative, because every positive residual has a strong probability to be followed by a negative one and vice-versa. This means that successive residuals exhibit negative correlation. The Durbin-Watson test is

dwtest(l4, alternative = "two.sided")

## 
##  Durbin-Watson test
## 
## data:  l4
## DW = 3.5199, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is not 0

The null hypothesis of no autocorrelation is also rejected now, but since the d-statistic is close to 4 we conclude that the autocorrelation is negative.

Outliers and Influential Observations

We use the dataframe d5 from file d5.Rdata:

load("d5.Rdata")

We fit a single-factor model of y against age and create a residual plot of jacknife residuals against fitted values:

l5=lm(y~x, data=d5)

In addition to the jacknife residuals, we can compute the leverage values and Cook’s distance for each observation using functions hatvalues and cooks.distance, respectively.

From the tables of critical values we get that for \(n=40, k=1, \alpha=0.05\) the threshold value of the jacknife residual so that an observation is considered outlier is approximately 3.5. For influential observations the critical value for the leverage is approximately 0.26 and for Cook’s distance approximately 19/39=0.49. We add the corresponding horizontal lines to the plots.

rj5=rstudent(l5)
summary(rj5)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.17761 -0.58864 -0.11320  0.01815  0.44198  4.32924

plot(rj5)
abline(3.5,0, col="red")
abline(-3.5,0, col="red")

lev5=hatvalues(l5)
summary(lev5)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02516 0.02959 0.04688 0.05000 0.05963 0.12311

plot(lev5)
abline(0.26, 0, col="red")

cook5=cooks.distance(l5)
summary(cook5)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 1.723e-05 1.220e-03 6.080e-03 4.738e-02 2.700e-02 6.905e-01

plot(cook5)
abline(0.48,0, col="red")

From the plots we see that observation 39 is an outlier. Using leverages no observation is considered influential. Using Cook’s distance observation 39 is also influential.