Are flexible designs sound? (October 23,
2006)
Carl-Fredrik Burman and Christian Sonesson
(AstraZeneca, Sweden) have recently published a paper in Biometrics
with an eye-catching title "Are flexible designs sound?" (the
references are given at the bottom of this page). The paper discusses
the pros and cons of various methods of analyzing data collected
in adaptive clinical trials.
Several thought leaders in the area of adaptive designs, including
Christopher Jennison, Bruce Turnbull, Michael Proschan, Peter
Bauer and Marianne Frisen, participated in the discussion of the
paper's main themes.
If you would like to share your thoughts
on this topic, we encourage you to contribute to the discussion
thread in the BioPharmNet forum.
A PhRMA-sponsored workshop "Adaptive Designs:
Opportunities, Challenges and Scope in Drug Development" held
in Bethesda, MD on November 13-14, 2006.
|
Dr. Burman, Statistical Science
Director in Technical and Scientific Development at AstraZeneca,
kindly agreed to answer several questions related to the
paper.
|
Alex Dmitrienko: Your paper begins with
a somewhat shocking statement that adaptive designs based on a
weighted test violate basic inference principles. Could you please
explain what you meant by this?
Carl-Fredrik Burman: "One patient one vote" has been the standard.
This is violated by the weighted test. On the contrary, equally
informative observations are weighted unequally. Consequently,
the weighted test is not in accordance with a number of the guiding
principles for how data should be analyzed.
What is the weighted test?
It is often tempting to change a trial design after an interim
analysis. Bauer and Köhne (1994) have shown that such changes
can be done while controlling the type I error, that is the risk
of falsely concluding a treatment effect. Their beautiful idea
is based on weighting together p-values from the two (or more)
stages of the trial.
What are flexible designs used for?
Using the weighted test idea, researchers such as Müller and Schäfer
have opened up the possibility to make your design very flexible.
Almost any design modification can be considered in response to
interim data. They say that you could change the study population,
the primary variable, the treatment duration, the doses, etc.
The greatest interest has been in sample size reestimation.
It is stated several times throughout the paper that flexible
designs based on weighted tests are not valid. In general, how
do you define the validity of a statistical approach?
When we say "not valid" we mean "not providing correct statistical
inference". We believe that the results are not convincing to
the scientific community, to regulators, to prescribing physicians
and, ultimately, to the patients in need of improved treatments.
This meaning of the word "valid" is, of course, a bit subjective.
We might disagree on what is convincing. However, I think that
we have demonstrated that the weighted test can give ridiculous
results in some situations.
Could you give an example of such a "ridiculous" result?
Yes; say that you have a very severe disease. You learn that a
new drug "statistically significantly" reduces mortality compared
to standard treatment, without any severe increase in side effects.
Then I guess that you would be quite keen on switching to the
new treatment. However, then you happen to see the actual data
that gave the "significant" result, according to the weighted
test. In the randomized clinical trial, with equally many patients
on both treatments, in fact more patients died when taking the
new drug. Thus, the mortality is higher but the statistical test
says it's lower. This is completely insensible … and it can not
be explained by covariates or something like that. It is explained
by the counterintuitive unequal weighting of the data before and
after a modification of the sample size. This example is similar
to Example 2 in our paper. It is a rather extreme situation but
I think that the example is enough to conclude that the weighted
test should not be used in an unrestricted way.
In your Example 2, wouldn't one simply conclude that inferences
at the end of a flexible study may differ from inferences performed
in a fixed-design study with the same sample size which is also
the case in group-sequential trials (for instance, sequential
task statistics follow a completely different distribution compared
to fixed-design statistics).
It is true that how to analyze data depends on the stopping rule.
However, the standard analysis of group-sequential designs gives
much more sensible results than the weighted test.
Are you concerned that the paper may send the wrong message,
for example, that essentially there is no statistically sound
approach to the analysis of data collected in a study based on
the flexible design?
I'm afraid that people will lose confidence in all adaptive designs
if they see examples of poor designs that are called adaptive.
PhRMA's working group on adaptive designs has emphasized that
these designs should not undermine the validity and integrity
of the trial. It is therefore important that we, all proponents
of adaptive designs, work together to define criteria for what
is a good adaptive design and a valid analysis of the data that
it generates.
Your main criticism of the weighted method in flexible trials
is that it violates the sufficiency principle. One can argue that
this is just a theoretical consideration which may not have much
impact on the use of weighted tests in practice. For example,
it is known that complete statistics are not generally available
in a standard group sequential setting. However, group sequential
clinical trials are still very popular.
Well, I would rather say that the main criticism is that a weighted
inference may lead to illogical conclusions. The inference principles
should rather be used as guidance than as firm rules. When we
saw that some of them were violated we could move on to construct
compelling examples (as Example 2) of the problems. The sufficiency
principle can be formulated in precise mathematical terms but
is essentially saying that the inference should be built on the
relevant information. There may be reasons (simplicity, robustness)
not to obey this principle. However, in these cases the inference
is typically based on an approximately sufficient statistic or
on a sufficient statistic for another and wider model. Statistical
tests following group sequential designs, for example, are normally
in accordance with the sufficiency principle.
You described the so-called dual test in Section 4 (reject
the null hypothesis of no treatment effect if the flexible and
fixed-design test statistics are significant). Is this the most
reliable option in flexible designs?
The weighted test idea is very clever. We should look for possibilities
to develop this idea further and hopefully get a less controversial
inference. Since I'm most concerned with the potential discrepancy
between the weighted test and a "naive" estimate, I think that
the dual test is promising. By looking also at the unweighted
Z score, based only on the unweighted statistic and the sample
size, we can safeguard against significant results which are not
supported by unweighted data. There are some issues around the
dual test but as a patient I would find that analysis much more
convincing than a weighted test alone.
Is the dual test much less efficient than the weighted test?
A striking observation is that the dual test is often as powerful
as the weighted one. When you see the interim data, you may choose
to modify the sample size in such a way that the naïve test is
automatically significant whenever the weighted test is.
You mentioned issues around the dual test?
Yes, for example the dual test also violates the sufficiency principle.
However, I think that this violation is less severe as a poor
drug will not get a counterintuitive significance. The risk is
rather that data may look convincing but that the weighted test
fails to be significant. Another problem is that the property
of being as powerful as the weighted test does not hold for all
significance levels simultaneously. There's also an interesting
example in Section 6 of the rejoinder. It combines problems with
the Bayesian approach and the weighted test to show how a significant
trial with large sample size can be constructed even if the treatment
effect is zero. The trick here is that many trials are started
but only one reported. This show the importance of transparency.
We should communicate all trials and provide the results in sufficient
detail.
What estimation methods, including both a point estimate and
associated confidence interval, should be used after the dual
test been carried out?
The simple standard estimates from a study with fixed sample size
do not carry over to group-sequential and adaptive designs. The
dual test has been studied very little. I would guess that the
correspondence theorem could give reasonable confidence intervals.
The same method would give a "median unbiased" point estimator,
which could be seen as a 0% symmetric confidence interval. These
are technical details but I don't think that it should be a severe
obstacle to using the dual test instead of the weighted.
Are there other alternatives than the dual test?
Yes. Importantly, Jennison and Turnbull argue that group-sequential
designs are usually more efficient than flexible designs, although
the extra flexibility may sometimes be needed. Also, if a sample
size reestimation rule is prespecified, there exist tests which
are in accordance with the sufficiency principle. For learning
trials in phase I-II, I would personally consider a Bayesian or
Decision Analytic way of looking at the data rather than a flexible
trial in the frequentist framework. It would also be possible
to restrict how a trial may be redesigned in order to decrease
the potential problems with a weighted inference.
You mentioned in the Conclusions section that the problems
identified in your paper do not arise if a Bayesian approach to
the analysis of flexible trials is adopted. This is true at a
conceptual level but doesn't this create new problems, for example,
problems related to the control of the Type I error probability
which is not explicitly defined in a Bayesian framework?
The frequentist and Bayesian schools face different challenges.
The discussion about the weighted test highlights a number of
issues for frequentist statistics. In particular, the specification
of the so called "sample space" is critical. As an example, it
normally does matter how the investigator would have changed the
design if interim data had been different to what it was. This
is not very intuitive! The Bayesian approach has other problems
such as the need of specifying a prior distribution for the parameters.
In many cases, the two approaches give similar results but there
are examples where they differ markedly. It is often a good idea
to check both Bayesian and frequentist properties of a procedure.
In a confirmatory trial, I would like a guarantee that the type
I error is about 5%. In earlier phases, I would be more open to
a Bayesian (decision theoretic) analysis.
In general, what motivated you to write this paper? Most of
the points you made came from other papers (for example, the use
of unrestricted weighted methods may produce spurious results).
Were you planning to provide a review of issues arising in this
area to point out the pros and cons of the existing flexible design
approaches?
I have heard that the key problems have been discussed very early.
In the Vienna group, they might have been known already before
the groundbreaking Bauer and Köhne paper was published. The second
paper in the area (Proschan and Hunsberger, 1995) mentions the
kind of problem that we highlight in Example 2. However, very
little attention has been given to these problems in the literature,
although the scientific production is vast in the area. We also
see how flexible designs get into clinical practice and that the
regulatory interest in adaptive designs is large and increasing.
This year, EMEA has issued a reflection paper on adaptive designs,
FDA has announced that adaptive design guidelines will be written,
and the upcoming PhRMA workshop on adaptive designs (November
13-14, 2006) is attracting a great interest. We therefore felt
that it was critical to clearly see the problems with the weighted
test and seek consensus regarding which requirements that should
be put on adaptive designs. At the end of the Biometrics rejoinder,
we suggest concrete rules concerning the transparency of the results
of a trial.
References
Burman, C.F., Sonesson, C. (2006). Are flexible
designs sound? Biometrics. 62, 664-678.
Bauer, P., Köhne, K. (1994). Evaluation of experiments with adaptive
interim analyses. Biometrics. 50, 1029–1041.
Jennison, C., Turnbull, B.W. (2006). Adaptive and nonadaptive
group sequential tests. Biometrika. 93, 1-21.
Proschan, M.A., Hunsberger, S.A. (1995). Designed extension of
studies based on conditional power. Biometrics. 51, 1315–1324.
| |
|