Introduction

In today’s lab, we will perform inference after the end of a study that involed in an interi analysis. Note that inference inolves test of hypotheses and estimation. The former results in tests involving p values and the latter in point estimates and confidence intervals. Both of these must account for the fact that the data were observed during multipole (interim) analyses instead of once at the eend of the study.

Definition of the p-value in the context of monitoring

The definition of a p-value in the fixed-sample design is the probability of observing a test statistic as extreme or more extreme than what was actually observed.

This is not so clear in the group sequential context. For example, if \(Z_1(\tau_i)>Z_2(\tau_i)\), i.e., the test statistics of two identical studies at the same interim analysis \(\tau_i\), it may be clear that \(Z_1(\tau_i)\) provides more extreme evidence than \(Z_2(\tau_i\)), written formally as \[ (\tau_i, Z_1)\succ (\tau_i, Z_2) \] However, the following is not as clear: Is \(Z(\tau_i)=3.50\) after stage \(\tau_i\) which, say, did not result in the interruption of the study, more or less extreme than \(Z(\tau_j)=3.50\) after stage \(\tau_j>\tau_i\) which, in this hypothetical experiment, resulted in the interruption of the study?

The major assumption that will be made for the following discussion is that the evidence leading to the stopping of the study is in stage \(j\) as it was in stage \(i<j\).

In other words, we assume that a study that was stopped in stage \(j\) generated more extreme evidence (larger deviations from the null) than was observed in stage \(i\) (where the study was not stopped). This is called “stage-wise ordering” of the sample space (see Proschan, Lan & DeMets, 2006).

It only matters that you reached the \(k\)th stopping time. How you reached it is irrelevant.

One-sided case

With the previous assumption in mind, the p-value in the one-sided case is \[\begin{eqnarray*} p = \underbrace{\mbox{Pr}\left ( \cup_{i=1}^{j-1}Z(t_i)> c_i\right )} & + & \underbrace{\mbox{Pr}\left ( \cap_{i=1}^{j-1}Z(t_i)\leq c_i, Z(t_j)> z_j\right )}\\ \mbox{Trial stops at } i<j & \mbox{OR} & \mbox{Trial did not stop at } i \mbox{ stops at }j \end{eqnarray*}\]

where \(z_j\) is the observed z-score at the final stage.

Two-sided case

For two-tailed p-values, suppose that the study was stopped at stage \(\tau_j\) and that the observed z-score was \(Z(\tau_j)=z_j\).

If the boundaries are symmetric around zero, then the two-sided p-value is the one-sided p-value applied to \(|Z(\tau_i)|\) for \(i=1,\cdots, j\) such that \[\begin{eqnarray*} p = \mbox{Pr}\left ( \cup_{i=1}^{j-1}|Z(t_i)|\geq c_i\right ) + \mbox{Pr}\left ( \cap_{i=1}^{j-1}|Z(t_i)|< c_i, |Z(\tau_j)|\geq z_j\right ) \end{eqnarray*}\]

Otherwise, the two-sided p-value is \[ p=2\min(p_L, p_U) \] where \(p_L\) and \(p_U\) are the one-sided p-values of crossing the lower and the upper boundaryrespectively.

Example (Proshan, Lan & Wittes, 2006)

Suppose that a trial was designed to have one interim and one final analysis with \(\alpha=0.05\) (one-sided). This results in critical points \(c_1=c_2=2.18\) .

The R code finding the bounds is as follows:

library(gsDesign)
# study design
x<-gsDesign(k=2, test.type=1, alpha=0.025, sfu="Pocock")
x
## One-sided group sequential design with
## 90 % power and 2.5 % Type I Error.
##            Sample
##             Size 
##   Analysis Ratio*  Z   Nominal p  Spend
##          1   0.55 2.18    0.0147 0.0147
##          2   1.10 2.18    0.0147 0.0103
##      Total                       0.0250 
## 
## ++ alpha spending:
##  Pocock boundary.
## * Sample size ratio compared to fixed design with no interim
## 
## Boundary crossing probabilities and expected sample size
## assume any cross stops the trial
## 
## Upper boundary (power or Type I Error)
##           Analysis
##    Theta      1      2 Total   E{N}
##   0.0000 0.0147 0.0103 0.025 1.0920
##   3.2415 0.5893 0.3107 0.900 0.7759

Suppose further that, during the second (and final) interim analysis we observed \(Z(\tau_2)=Z(1)=2.30\) (i.e., the study did not stop at the first analysis at \(\tau_1=0.5\) but the null hypothesis was rejected at the second and final analysis at \(\tau_2=1\)).

To find the exit probability we use the package ldbounds as follows:

# One-sided exit p-value
library(ldbounds)
# Timing of the analyses
t<-c(0.5, 1)
# Exit probability
pvalue<-drift(za=c(-8,-8), zb=c(x$upper$bound[1], 2.30), t=t, drft=0)
summary(pvalue)
## 
## Lan-DeMets method for group sequential boundaries 
##  
## n =  2 
## 
## Boundaries: 
##    Time  Lower   Upper
## 1   0.5     -8  2.1783
## 2   1.0     -8  2.3000
## 
## Drift parameters:  0 
##    Time  Lower probs  Upper probs   Exit pr.  Cum exit pr.
## 1   0.5   6.2210e-16    0.0146929  0.0146929      0.014693
## 2   1.0   6.2161e-16    0.0071292  0.0071292      0.021822

The p-value is \[ p=\mbox{Pr}\left (Z(0.5)\geq 2.18\cup Z(1)\geq 2.30\right )=0.0218 \]

A couple of things to note in the above code. First of all, the code

zb=c(x$upper$bound[1], 2.30)
zb
## [1] 2.178272 2.300000

In other words we took the first upper bound produced by the gsDesign command previously, and added the observed test statistic after the second and final analysis. Similarly, the code

za=c(-8,-8)
za
## [1] -8 -8

is the way that we make the test one-sided, by assigning virtually zero probability to the lower bound.

The second required argument in the drift command is drft=0. This basically says that we want to compute the exit probabilities under the null hypothesis, i.e., assuming effect size (or “drift” in group-sequential parlance) equal to zero.

Confidence intervals

Recall that, without monitoring, we observe \(\hat{\delta}\) and \(z_{\mbox{obs}}=\hat{\delta}/\sqrt{\mbox{var}(\hat{\delta})}\), where \(\hat{\delta}\) is \(\bar{X}_1-\bar{X}_2\), \(p_C-p_T\) or \(\lambda_1/\lambda_2\) in two-sided tests of means, proportions or the hazard ratio in studies of time to an event, and \(z_{\mbox{obs}}\) is the observed z-score for testing the null hypothesis \(H_0:\delta=0\).

The statistical test rejects the null hypothesis if \[ p=\mbox{Pr}\left (|Z|\geq z_{\mbox{obs}}\right )\leq \alpha \] Confidence intervals can be constructed by “inverting” this hypothesis test.

Definition in terms of p values

Inversion of the hypothesis test means that, using the above probability statement, we calculate lower and upper bounds \(\delta_L\) and \(\delta_U\) respectively such that

\[\begin{eqnarray*} \mbox{Pr}(\delta_L\geq \delta)& = & \mbox{Pr}\left ( \frac{\delta_L-\hat{\delta}}{\sqrt{\mbox{var}(\hat{\delta})}}\geq \frac{\delta-\hat{\delta}}{\sqrt{\mbox{var}(\hat{\delta})}}\right ) = \mbox{Pr}\left ( \frac{\hat{\delta}-\delta_L}{\sqrt{\mbox{var}(\hat{\delta})}}\geq z_{\mbox{obs}}\right ) \leq \alpha/2 \end{eqnarray*}\]

and similarly for the upper bound \(\delta_U\). Note that this holds because \(\mbox{Pr}\left ( \frac{\hat{\delta}-\delta_L}{\sqrt{\mbox{var}(\hat{\delta})}}\geq z_{\mbox{obs}}\right )=\mbox{Pr}\left ( \frac{\hat{\delta}-\delta_L}{\sqrt{\mbox{var}(\hat{\delta})}}\leq -z_{\mbox{obs}}\right )\) due to the symmetry of the normal distribution. This is equivalent to using a z-score with \(\delta_L\) as the mean of the distribution so that \[ p_{\delta_L} = \mbox{Pr}(Z>z_{\mbox{obs}}|\delta=\delta_L)\leq \alpha/2 \] Similarly, the upper bound \(\delta_U\) is calculated by considering \[ p_{\delta_U}=\mbox{Pr}(Z< z_{\mbox{obs}}|\delta=\delta_U)\leq \alpha/2 \]

Confidence intervals resulting from stage-wise ordering

Confidence intervals calculated after a study which includes interim monitoring should have the following properties:

  1. The confidence interval should be a (contiguous) interval
  2. It should agree with the original test. In other words, if the test rejected \(H_0\), then the value of \(\delta\) under the null should not be contained within the interval.
  3. The confidence interval should contain the MLE \(\hat\delta=Z(t)/\sqrt{\nu_T}\)
  4. A narrower confidence interval is to be preferred to a wider one

All of these properties hold under the stage-wise ordering espoused in these notes.

Example: Diet trial (Proshan, Lan & Wittes, 2006)

In a clinical trial of 200 participants per arm the primary endpoint was weight change over 3 months. An O’Brien-Fleming spending function was used and four analyses of the data were planned. Initially, we would have planned this triall assuming four equally spaced analyses

originalOFdesign<-gsDesign(k=4, test.type=2, sfu=sfLDOF, alpha=0.025, beta=0.1)
cbind(1:4,originalOFdesign$upper$bound)
##      [,1]     [,2]
## [1,]    1 4.332634
## [2,]    2 2.963131
## [3,]    3 2.359044
## [4,]    4 2.014090

This is not exactly how things turned out. In reality, the first two analyses occurred at times \(\tau_1=0.22\) and \(\tau_2=0.55\). The third analysis occurred after \(n_T=152\) and \(n_C=144\) subjects had been accrued in the treatment and control arms respectively, that is, at (information) time \(\tau_3=0.74\).

With this situation, the O’Brien-Fleming bounds would be

actualOFdesign<-gsDesign(k=4, test.type=2, sfu=sfLDOF, alpha=0.025,
                         beta=0.1, timing=c(0.22, 0.55, 0.74, 1))
cbind(1:4, actualOFdesign$upper$bound)
##      [,1]     [,2]
## [1,]    1 4.637360
## [2,]    2 2.806017
## [3,]    3 2.391246
## [4,]    4 2.012485

In other words,

Table 1. Bounds of the design as actually run.
Analyses Bounds
1 4.637360
2 2.806017
3 2.391246
4 2.012485

The z-score at the third interim analysis was \[ Z(0.74)=\frac{\bar{X}_1-\bar{X}_2}{\sqrt{(4.8)^2(1/152+1/144)}}=3.76 \] reflecting a sample standard deviation \(s=4.8\) and \(\hat{\delta}(\tau_3)=2.099\).

As \(Z(0.74)=3.76>2.39\) the study stops. The cumulative exit probability is calculated as follows:

dietstudyp<-drift(zb=c(actualOFdesign$upper$bound[1:2],3.76), 
      t=c(0.22, 0.55, 0.74), drft=0)
cbind(1:3,dietstudyp$cum.exit)
##      [,1]         [,2]
## [1,]    1 3.528878e-06
## [2,]    2 5.017117e-03
## [3,]    3 5.037087e-03

Thus, the cumulative two-sided (“exit”) probability is \(p=0.005\).

Confidence interval of the effect size

Recall that, in a fixed-sample design, \[ N=\frac{\sigma^2(z_{1-\alpha}+z_{1-\beta})^2}{\delta^2} \] so that \[ \theta=(z_{1-\alpha}+z_{1-\beta})=\frac{\delta}{\sqrt{{\rm var}(\delta)}}=\frac{\delta}{\sqrt{2\sigma^2/N}}=f \] where \(f=\frac{\delta}{\sigma/\sqrt{N}}\) is the effect size and \(N\) is the sample size for each group at the completion of the study. In addition, \(\delta\), the effect size (e.g., the difference in average weight loss between the two interventions in the diet example) is equal to \[ \delta=\theta\sqrt{2\sigma^2/N} \] When we have an interim analysis, the software can produce an updated \(\theta^*\), which takes into account the interim looks of the data.

In our example, the fixed-sample effect size is \[ \theta=(z_{1-\alpha/2}+z_{1-\beta})=(1.96+1.282)=3.2415 \] The updated effect size taking into consideration the interim analyses is

dietstudydrift<-drift(zb=actualOFdesign$upper$bound, 
      t=c(0.22, 0.55, 0.74, 1), pow=0.9)
dietstudydrift$drift
## [1] 3.270813

so that \(\theta^*=3.2708\) is the minimum effect size that will result in rejection of the null hypothesis. The effect size in terms of weight loss difference in the two treatment groups (accounting for the interim analyses) will be \[ \delta^*=\theta^*\sqrt{2\sigma^2/N}=3.2708\sqrt{2(4.8)^2/200}=1.57 \] That is, it would take a difference in the loss of weight of \(\delta^*=1.57\) lbs in order to reject the null hypothesis after three interim analyses.

A similar situation arises when we want to derive a two-sided confidence interval for the effect size. The software will produce a confidence interval in terms of \(\theta^*\), \((\theta^*_L, \theta_U^*)\). Then the two-sided confidence interval for the effect size will be \[ (\delta^*_L, \delta_U^*)=\left (\theta^*_L\sqrt{2\sigma^2/N}, \theta_U\sqrt{2\sigma^2/N}\right ) \] The R code performing this analysis is

dietstudyCI<-drift(zb=c(actualOFdesign$upper$bound[1:3]), 
      t=c(0.22, 0.55, 0.74), conf=0.95, zval=3.76)
cat("(",dietstudyCI$conf.interval$lower.limit/sqrt(0.74),",",
    dietstudyCI$conf.interval$upper.limit/sqrt(0.74), ")")
## ( 1.134223 , 6.210698 )
Since \(\delta^*=\theta^*\sqrt{2\sigma^2/N}\), the relevant 95% confidence interval in the scale of \(\delta^*\) is \[\begin{eqnarray*} (\delta_L, \delta_U) & = & (\theta_L\sqrt{2\sigma^2/N}, \theta_U\sqrt{2\sigma^2/N})\\ & = & (1.134\sqrt{2(4.8)^2/200}, 6.211\sqrt{2(4.8)^2/200}) = (0.544, 2.981) \end{eqnarray*}\]

This means that the experimental treatment reduces weight by between half and three kilograms.

Carrying out analyses in terms of the effect size is a powerful way to perform interim testing, since the same software can apply to any situation regardless of whether the specific study tested differences in means, proportions or time-to-event studies. All you need is an estimate of the revised drift and the appropriate definition of the effect \(\delta\) and its associated variance \({\rm var}(\delta)\).

  1. Here, \(\delta=\mu_E-\mu_C\) and \({\rm var}(\delta)=2\sigma^2/N\)
  2. Here \(\delta=p_E-p_C\) and \({\rm var}(\delta)=2p(1-p)/N\), where \(p=\frac{p_E+p_C}{2}\)
  3. Here \(\delta=\log\left (\frac{\lambda_E}{\lambda_C}\right )\) and \({\rm var}(\delta)\approx 4/D\), where \(D\) is the total number of .

Bibliography

Proschan, M.A., Lan K.K.G. and Wittes J.T. (2006). Statistical Monitoring of Clinical Trials: A unified approach. Springer, New York, NY.