BioPharmNet
Topics
BioPharmNet forum
 
Are flexible designs sound? (October 23, 2006)
Carl-Fredrik Burman and Christian Sonesson (AstraZeneca, Sweden) have recently published a paper in Biometrics with an eye-catching title "Are flexible designs sound?" (the references are given at the bottom of this page). The paper discusses the pros and cons of various methods of analyzing data collected in adaptive clinical trials.
Several thought leaders in the area of adaptive designs, including Christopher Jennison, Bruce Turnbull, Michael Proschan, Peter Bauer and Marianne Frisen, participated in the discussion of the paper's main themes.
If you would like to share your thoughts on this topic, we encourage you to contribute to the discussion thread in the BioPharmNet forum.
A PhRMA-sponsored workshop "Adaptive Designs: Opportunities, Challenges and Scope in Drug Development" held in Bethesda, MD on November 13-14, 2006.
Interview with Dr. Carl-Fredrik Burman (conducted by Alex Dmitrienko)
Dr. Burman, Statistical Science Director in Technical and Scientific Development at AstraZeneca, kindly agreed to answer several questions related to the paper.
Alex Dmitrienko: Your paper begins with a somewhat shocking statement that adaptive designs based on a weighted test violate basic inference principles. Could you please explain what you meant by this?
Carl-Fredrik Burman: "One patient one vote" has been the standard. This is violated by the weighted test. On the contrary, equally informative observations are weighted unequally. Consequently, the weighted test is not in accordance with a number of the guiding principles for how data should be analyzed.

What is the weighted test?
It is often tempting to change a trial design after an interim analysis. Bauer and Köhne (1994) have shown that such changes can be done while controlling the type I error, that is the risk of falsely concluding a treatment effect. Their beautiful idea is based on weighting together p-values from the two (or more) stages of the trial.

What are flexible designs used for?
Using the weighted test idea, researchers such as Müller and Schäfer have opened up the possibility to make your design very flexible. Almost any design modification can be considered in response to interim data. They say that you could change the study population, the primary variable, the treatment duration, the doses, etc. The greatest interest has been in sample size reestimation.

It is stated several times throughout the paper that flexible designs based on weighted tests are not valid. In general, how do you define the validity of a statistical approach?
When we say "not valid" we mean "not providing correct statistical inference". We believe that the results are not convincing to the scientific community, to regulators, to prescribing physicians and, ultimately, to the patients in need of improved treatments. This meaning of the word "valid" is, of course, a bit subjective. We might disagree on what is convincing. However, I think that we have demonstrated that the weighted test can give ridiculous results in some situations.

Could you give an example of such a "ridiculous" result?
Yes; say that you have a very severe disease. You learn that a new drug "statistically significantly" reduces mortality compared to standard treatment, without any severe increase in side effects. Then I guess that you would be quite keen on switching to the new treatment. However, then you happen to see the actual data that gave the "significant" result, according to the weighted test. In the randomized clinical trial, with equally many patients on both treatments, in fact more patients died when taking the new drug. Thus, the mortality is higher but the statistical test says it's lower. This is completely insensible … and it can not be explained by covariates or something like that. It is explained by the counterintuitive unequal weighting of the data before and after a modification of the sample size. This example is similar to Example 2 in our paper. It is a rather extreme situation but I think that the example is enough to conclude that the weighted test should not be used in an unrestricted way.

In your Example 2, wouldn't one simply conclude that inferences at the end of a flexible study may differ from inferences performed in a fixed-design study with the same sample size which is also the case in group-sequential trials (for instance, sequential task statistics follow a completely different distribution compared to fixed-design statistics).
It is true that how to analyze data depends on the stopping rule. However, the standard analysis of group-sequential designs gives much more sensible results than the weighted test.

Are you concerned that the paper may send the wrong message, for example, that essentially there is no statistically sound approach to the analysis of data collected in a study based on the flexible design?
I'm afraid that people will lose confidence in all adaptive designs if they see examples of poor designs that are called adaptive. PhRMA's working group on adaptive designs has emphasized that these designs should not undermine the validity and integrity of the trial. It is therefore important that we, all proponents of adaptive designs, work together to define criteria for what is a good adaptive design and a valid analysis of the data that it generates.

Your main criticism of the weighted method in flexible trials is that it violates the sufficiency principle. One can argue that this is just a theoretical consideration which may not have much impact on the use of weighted tests in practice. For example, it is known that complete statistics are not generally available in a standard group sequential setting. However, group sequential clinical trials are still very popular.
Well, I would rather say that the main criticism is that a weighted inference may lead to illogical conclusions. The inference principles should rather be used as guidance than as firm rules. When we saw that some of them were violated we could move on to construct compelling examples (as Example 2) of the problems. The sufficiency principle can be formulated in precise mathematical terms but is essentially saying that the inference should be built on the relevant information. There may be reasons (simplicity, robustness) not to obey this principle. However, in these cases the inference is typically based on an approximately sufficient statistic or on a sufficient statistic for another and wider model. Statistical tests following group sequential designs, for example, are normally in accordance with the sufficiency principle.

You described the so-called dual test in Section 4 (reject the null hypothesis of no treatment effect if the flexible and fixed-design test statistics are significant). Is this the most reliable option in flexible designs?
The weighted test idea is very clever. We should look for possibilities to develop this idea further and hopefully get a less controversial inference. Since I'm most concerned with the potential discrepancy between the weighted test and a "naive" estimate, I think that the dual test is promising. By looking also at the unweighted Z score, based only on the unweighted statistic and the sample size, we can safeguard against significant results which are not supported by unweighted data. There are some issues around the dual test but as a patient I would find that analysis much more convincing than a weighted test alone.

Is the dual test much less efficient than the weighted test?
A striking observation is that the dual test is often as powerful as the weighted one. When you see the interim data, you may choose to modify the sample size in such a way that the naïve test is automatically significant whenever the weighted test is.

You mentioned issues around the dual test?
Yes, for example the dual test also violates the sufficiency principle. However, I think that this violation is less severe as a poor drug will not get a counterintuitive significance. The risk is rather that data may look convincing but that the weighted test fails to be significant. Another problem is that the property of being as powerful as the weighted test does not hold for all significance levels simultaneously. There's also an interesting example in Section 6 of the rejoinder. It combines problems with the Bayesian approach and the weighted test to show how a significant trial with large sample size can be constructed even if the treatment effect is zero. The trick here is that many trials are started but only one reported. This show the importance of transparency. We should communicate all trials and provide the results in sufficient detail.

What estimation methods, including both a point estimate and associated confidence interval, should be used after the dual test been carried out?
The simple standard estimates from a study with fixed sample size do not carry over to group-sequential and adaptive designs. The dual test has been studied very little. I would guess that the correspondence theorem could give reasonable confidence intervals. The same method would give a "median unbiased" point estimator, which could be seen as a 0% symmetric confidence interval. These are technical details but I don't think that it should be a severe obstacle to using the dual test instead of the weighted.

Are there other alternatives than the dual test?
Yes. Importantly, Jennison and Turnbull argue that group-sequential designs are usually more efficient than flexible designs, although the extra flexibility may sometimes be needed. Also, if a sample size reestimation rule is prespecified, there exist tests which are in accordance with the sufficiency principle. For learning trials in phase I-II, I would personally consider a Bayesian or Decision Analytic way of looking at the data rather than a flexible trial in the frequentist framework. It would also be possible to restrict how a trial may be redesigned in order to decrease the potential problems with a weighted inference.

You mentioned in the Conclusions section that the problems identified in your paper do not arise if a Bayesian approach to the analysis of flexible trials is adopted. This is true at a conceptual level but doesn't this create new problems, for example, problems related to the control of the Type I error probability which is not explicitly defined in a Bayesian framework?
The frequentist and Bayesian schools face different challenges. The discussion about the weighted test highlights a number of issues for frequentist statistics. In particular, the specification of the so called "sample space" is critical. As an example, it normally does matter how the investigator would have changed the design if interim data had been different to what it was. This is not very intuitive! The Bayesian approach has other problems such as the need of specifying a prior distribution for the parameters. In many cases, the two approaches give similar results but there are examples where they differ markedly. It is often a good idea to check both Bayesian and frequentist properties of a procedure. In a confirmatory trial, I would like a guarantee that the type I error is about 5%. In earlier phases, I would be more open to a Bayesian (decision theoretic) analysis.

In general, what motivated you to write this paper? Most of the points you made came from other papers (for example, the use of unrestricted weighted methods may produce spurious results). Were you planning to provide a review of issues arising in this area to point out the pros and cons of the existing flexible design approaches?
I have heard that the key problems have been discussed very early. In the Vienna group, they might have been known already before the groundbreaking Bauer and Köhne paper was published. The second paper in the area (Proschan and Hunsberger, 1995) mentions the kind of problem that we highlight in Example 2. However, very little attention has been given to these problems in the literature, although the scientific production is vast in the area. We also see how flexible designs get into clinical practice and that the regulatory interest in adaptive designs is large and increasing. This year, EMEA has issued a reflection paper on adaptive designs, FDA has announced that adaptive design guidelines will be written, and the upcoming PhRMA workshop on adaptive designs (November 13-14, 2006) is attracting a great interest. We therefore felt that it was critical to clearly see the problems with the weighted test and seek consensus regarding which requirements that should be put on adaptive designs. At the end of the Biometrics rejoinder, we suggest concrete rules concerning the transparency of the results of a trial.
References
Burman, C.F., Sonesson, C. (2006). Are flexible designs sound? Biometrics. 62, 664-678.
Bauer, P., Köhne, K. (1994). Evaluation of experiments with adaptive interim analyses. Biometrics. 50, 1029–1041.
Jennison, C., Turnbull, B.W. (2006). Adaptive and nonadaptive group sequential tests. Biometrika. 93, 1-21.
Proschan, M.A., Hunsberger, S.A. (1995). Designed extension of studies based on conditional power. Biometrics. 51, 1315–1324.